[jira] [Updated] (ARROW-2504) [Website] Add ApacheCon NA link

2018-11-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2504:
--
Labels: pull-request-available  (was: )

> [Website] Add ApacheCon NA link
> ---
>
> Key: ARROW-2504
> URL: https://issues.apache.org/jira/browse/ARROW-2504
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> See instructions in http://apache.org/events/README.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2504) [Website] Add ApacheCon NA link

2018-11-25 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698493#comment-16698493
 ] 

Tanya Schlusser commented on ARROW-2504:


Newbie here – looks like a good first issue for me so I'm claiming it thank you!

> [Website] Add ApacheCon NA link
> ---
>
> Key: ARROW-2504
> URL: https://issues.apache.org/jira/browse/ARROW-2504
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>
> See instructions in http://apache.org/events/README.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3860) [Gandiva] [C++] Fix packaging broken recently

2018-11-25 Thread Praveen Kumar Desabandu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698492#comment-16698492
 ] 

Praveen Kumar Desabandu commented on ARROW-3860:


"beware: if someone links to another JNI-wrapped C++ library with different 
symbols, it may crash."

Yup ran into this already :) had to fix by hiding symbols from the jni shared 
library.

Ok will try to make this a input flag, so that arrow can continue to distribute 
gandiva with std-c++ dynamically linked and we can substitute that with a 
statically linked one.

 

> [Gandiva] [C++] Fix packaging broken recently
> -
>
> Key: ARROW-3860
> URL: https://issues.apache.org/jira/browse/ARROW-3860
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This 
> [commit|https://github.com/apache/arrow/commit/ba2b2ea2301f067cc95306e11546ddb6d402a55c#diff-d5e5df5984ba660e999a7c657039f6af]
>  broke gandiva packaging by removing static linking of std c++, since dremio 
> consumes a fat jar that includes packaged gandiva native libraries we would 
> need to statically link std c++
> As suggested in the commit message will re-introduce it as a CMake Flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698439#comment-16698439
 ] 

Suvayu Ali edited comment on ARROW-3874 at 11/26/18 3:13 AM:
-

I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}

I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?


was (Author: suvayu):
I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}
I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698439#comment-16698439
 ] 

Suvayu Ali commented on ARROW-3874:
---

I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}
I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3877) [C++] Provide access to "maximum decompressed size" functions in compression libraries (if they exist)

2018-11-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3877:
---

 Summary: [C++] Provide access to "maximum decompressed size" 
functions in compression libraries (if they exist)
 Key: ARROW-3877
 URL: https://issues.apache.org/jira/browse/ARROW-3877
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


As follow up to ARROW-3831, some compression libraries have a function to 
provide a hint for sizing the output buffer (if it is not known already) for 
one-shot decompression. This would be helpful for sizing allocations in such 
cases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3831) [C++] arrow::util::Codec::Decompress() doesn't return decompressed data size

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3831.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3024
[https://github.com/apache/arrow/pull/3024]

> [C++] arrow::util::Codec::Decompress() doesn't return decompressed data size
> 
>
> Key: ARROW-3831
> URL: https://issues.apache.org/jira/browse/ARROW-3831
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can't know decompressed data size when we only have compressed data. The 
> current {{arrow::util::Codec::Decompress()}} doesn't return decompressed data 
> size. So we can't know which data in {{output_buffer}} can be used.
> FYI: {{arrow::util::Codec::Compress()}} returns compressed data size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3677) [Go] implement FixedSizedBinary array

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3677.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3012
[https://github.com/apache/arrow/pull/3012]

> [Go] implement FixedSizedBinary array
> -
>
> Key: ARROW-3677
> URL: https://issues.apache.org/jira/browse/ARROW-3677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Sebastien Binet
>Assignee: Alexandre Crayssac
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2591) [Python] Segmentation fault when writing empty ListType column to Parquet

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2591:

Fix Version/s: 0.12.0

> [Python] Segmentation fault when writing empty ListType column to Parquet
> -
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3860) [Gandiva] [C++] Fix packaging broken recently

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698289#comment-16698289
 ] 

Wes McKinney commented on ARROW-3860:
-

It isn't appropriate to statically link libstdc++ in general binary library 
distributions -- that's part of why I've been making a stink about this issue 
because the binaries produced by this can only be used in a very narrow 
context. 

If you use the libraries to build an application, depending on the platform 
where you build, you may have symbol conflicts that will yield segfaults / core 
dumps. 

I can see the argument for building the binaries this way in a JAR distribution 
of the JNI bindings. Though, beware: if someone links to another JNI-wrapped 
C++ library with different symbols, it may crash. 

> [Gandiva] [C++] Fix packaging broken recently
> -
>
> Key: ARROW-3860
> URL: https://issues.apache.org/jira/browse/ARROW-3860
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This 
> [commit|https://github.com/apache/arrow/commit/ba2b2ea2301f067cc95306e11546ddb6d402a55c#diff-d5e5df5984ba660e999a7c657039f6af]
>  broke gandiva packaging by removing static linking of std c++, since dremio 
> consumes a fat jar that includes packaged gandiva native libraries we would 
> need to statically link std c++
> As suggested in the commit message will re-introduce it as a CMake Flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2591) [Python] Segmentation fault when writing empty ListType column to Parquet

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698286#comment-16698286
 ] 

Wes McKinney commented on ARROW-2591:
-

> May I have it fixed?

[~jafournier] for future reference, it isn't ideal in open source projects to 
ask volunteers to fix bugs for you in this way. After you report the bug; if it 
is deemed a priority by another developer, they may fix it. Otherwise, if they 
do not fix it, and you need the fix sooner, we would be glad to accept a pull 
request. 

> [Python] Segmentation fault when writing empty ListType column to Parquet
> -
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2591) [Python] Segmentation fault when writing empty ListType column to Parquet

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2591:

Summary: [Python] Segmentation fault when writing empty ListType column to 
Parquet  (was: [Python] Segmentation fault issue in pq.write_table)

> [Python] Segmentation fault when writing empty ListType column to Parquet
> -
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698283#comment-16698283
 ] 

Wes McKinney commented on ARROW-2026:
-

I agree that probably the correct fix here is to coerce all timestamps to 
nanoseconds with INT96 storage, since if you are passing 
{{use_deprecated_int96_timestamps=True}} then probably you are using Apache 
Impala (which is what Redshift Spectrum uses as far as I understand it) or 
another system that expects int96 nanosecond timestamps. 

It is a bit of a rough edge to have to go through and convert all your 
timestamps to nanoseconds before writing to Parquet. 

[~xhochy] do you have thoughts about this? 

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2026:

Fix Version/s: 0.12.0

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2026:

Fix Version/s: (was: 0.12.0)

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698282#comment-16698282
 ] 

Wes McKinney commented on ARROW-3874:
-

Also as noted in https://issues.apache.org/jira/browse/ARROW-3846, our 
FindLLVM.cmake needs to be revamped so it can also work on Windows

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3728.
-
Resolution: Fixed

Issue resolved by pull request 3029
[https://github.com/apache/arrow/pull/3029]

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3875) [Python] Try to cast or normalize schemas when writing a table to ParquetWriter

2018-11-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3875:
---

 Summary: [Python] Try to cast or normalize schemas when writing a 
table to ParquetWriter
 Key: ARROW-3875
 URL: https://issues.apache.org/jira/browse/ARROW-3875
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


If we can cast safely to the file schema, it would improve usability to do so 
automatically. This auto-normalize behavior could be toggled on/off if so 
desired



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3876) [Python] Try to cast or normalize schemas when writing a table to ParquetWriter

2018-11-25 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3876:
---

 Summary: [Python] Try to cast or normalize schemas when writing a 
table to ParquetWriter
 Key: ARROW-3876
 URL: https://issues.apache.org/jira/browse/ARROW-3876
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


If we can cast safely to the file schema, it would improve usability to do so 
automatically. This auto-normalize behavior could be toggled on/off if so 
desired



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3728:

Fix Version/s: 0.12.0

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3728:
---

Assignee: Krisztian Szucs

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698277#comment-16698277
 ] 

Wes McKinney commented on ARROW-3874:
-

How did you install LLVM? You're missing the LLVM static libraries, so you're 
going to have some problems in any case. Here's what my LLVM lib directory 
looks like using libraries from the Ubuntu 14.04 apt repostiory on apt.llvm.org

https://gist.github.com/wesm/4cc5c786c4fc37310b9af3b24a819fa2

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization

2018-11-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698273#comment-16698273
 ] 

Wes McKinney commented on ARROW-1957:
-

Nanoseconds are being added as a supported timestamp resolution for INT64 
storage, see

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L261

Upgrading to make use of this in Parquet C++ is not trivial (umbrella JIRA for 
this is PARQUET-1411), so I suggest leaving this issue open until this is 
supported in Arrow writes

> [Python] Handle nanosecond timestamps in parquet serialization
> --
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit

2018-11-25 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1957:

Summary: [Python] Write nanosecond timestamps using new NANO LogicalType 
Parquet unit  (was: [Python] Handle nanosecond timestamps in parquet 
serialization)

> [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
> 
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization

2018-11-25 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698266#comment-16698266
 ] 

Krisztian Szucs commented on ARROW-1957:


[~wesmckinn] Invalid?

> [Python] Handle nanosecond timestamps in parquet serialization
> --
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected

2018-11-25 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-3874:
-

 Summary: [Gandiva] Cannot build: LLVM not detected
 Key: ARROW-3874
 URL: https://issues.apache.org/jira/browse/ARROW-3874
 Project: Apache Arrow
  Issue Type: Bug
  Components: Gandiva
Affects Versions: 0.12.0
 Environment: Fedora 28, master (8d5bfc65)
gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
llvm 6.0.1
Reporter: Suvayu Ali
 Attachments: CMakeError.log, CMakeOutput.log

I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
detecting LLVM on the system.
{code}
$ cd build/data-an/arrow/arrow/cpp/
$ export ARROW_HOME=/opt/data-an
$ mkdir release
$ cd release/
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
-DARROW_GANDIVA=ON ../
[...]
-- Found LLVM 6.0.1
-- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
  Target X86 is not in the set of libraries.
Call Stack (most recent call first):
  cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
  src/gandiva/CMakeLists.txt:25 (find_package)


-- Configuring incomplete, errors occurred!
{code}
The cmake log files are attached.

When I invoke cmake with options other than *Gandiva*, it finishes successfully.

Here are the llvm libraries that are installed on my system:
{code}
$ rpm -qa llvm\* | sort
llvm3.9-libs-3.9.1-13.fc28.x86_64
llvm4.0-libs-4.0.1-5.fc28.x86_64
llvm-6.0.1-8.fc28.x86_64
llvm-devel-6.0.1-8.fc28.x86_64
llvm-libs-6.0.1-8.fc28.i686
llvm-libs-6.0.1-8.fc28.x86_64
$ ls /usr/lib64/libLLVM* /usr/include/llvm
/usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so

/usr/include/llvm:
ADT  FuzzMutate  Object Support
Analysis InitializePasses.h  ObjectYAML TableGen
AsmParserIR  Option Target
BinaryFormat IRReaderPassAnalysisSupport.h  Testing
Bitcode  LineEditor  Passes ToolDrivers
CodeGen  LinkAllIR.h Pass.h Transforms
Config   LinkAllPasses.h PassInfo.h WindowsManifest
DebugInfoLinker  PassRegistry.h WindowsResource
Demangle LTO PassSupport.h  XRay
ExecutionEngine  MC  ProfileData
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3874:
--
Summary: [Gandiva] Cannot build: LLVM not detected correctly  (was: 
[Gandiva] Cannot build: LLVM not detected)

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-25 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698229#comment-16698229
 ] 

Krisztian Szucs edited comment on ARROW-2026 at 11/25/18 7:15 PM:
--

What's the expectation here? 

So currently only NANO timestamps are supported for Int96 writing, should We 
support all of the units?
Or just document it somewhere?


was (Author: kszucs):
What's the expectation here? So currently only NANO timestamps are supported 
for Int96 writing, should We support all of the units?

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3728:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Priority: Major
>  Labels: parquet, pull-request-available
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3860) [Gandiva] [C++] Fix packaging broken recently

2018-11-25 Thread Praveen Kumar Desabandu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698237#comment-16698237
 ] 

Praveen Kumar Desabandu commented on ARROW-3860:


The same packaging script is used by Kristian in the crossbow repo - 
[https://github.com/kszucs/crossbow/tree/build-355-gandiva-jar-trusty.] I 
thought these would be used in the arrow release build to release gandiva too 
along with the other arrow libraries. So i was wondering if i could change the 
script to produce the gandiva libraries statically linked to std-c++..

> [Gandiva] [C++] Fix packaging broken recently
> -
>
> Key: ARROW-3860
> URL: https://issues.apache.org/jira/browse/ARROW-3860
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This 
> [commit|https://github.com/apache/arrow/commit/ba2b2ea2301f067cc95306e11546ddb6d402a55c#diff-d5e5df5984ba660e999a7c657039f6af]
>  broke gandiva packaging by removing static linking of std c++, since dremio 
> consumes a fat jar that includes packaged gandiva native libraries we would 
> need to statically link std c++
> As suggested in the commit message will re-introduce it as a CMake Flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2591) [Python] Segmentation fault issue in pq.write_table

2018-11-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2591:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Segmentation fault issue in pq.write_table
> ---
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet, pull-request-available
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2591) [Python] Segmentation fault issue in pq.write_table

2018-11-25 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-2591:
--

Assignee: Krisztian Szucs

> [Python] Segmentation fault issue in pq.write_table
> ---
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3869) [Rust] "invalid fastbin errors" since Rust nightly-2018-11-03

2018-11-25 Thread Andy Grove (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698234#comment-16698234
 ] 

Andy Grove commented on ARROW-3869:
---

I have been unable to conclusively debug this so far ... I commented out the 
code in the drop method so we never free memory and I still hit the same error.

> [Rust] "invalid fastbin errors" since Rust nightly-2018-11-03
> -
>
> Key: ARROW-3869
> URL: https://issues.apache.org/jira/browse/ARROW-3869
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.10.0
> Environment: Ubuntu 16.04.5 LTS.
> gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Many tests in my DataFusion project started failing since Rust 
> nightly-2018-11-03 with this error.= when calling Arrow to create new arrays.
> {code:java}
> *** Error in 
> `/home/andy/git/andygrove/datafusion/target/debug/deps/datafusion-fe1bbf92d599090f':
>  invalid fastbin entry (free): 0x7f5c3c005710 ***
> === Backtrace: =
> /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f5c499aa7e5]
> /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f5c499b337a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x82d52)[0x7f5c499b5d52]
> /lib/x86_64-linux-gnu/libc.so.6(posix_memalign+0x11d)[0x7f5c499ba71d]
> /home/andy/git/andygrove/datafusion/target/debug/deps/datafusion-fe1bbf92d599090f(_ZN5arrow6memory16allocate_aligned17h22616ca11b0b7ea8E+0x38)[0x565548dea148]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2591) [Python] Segmentation fault issue in pq.write_table

2018-11-25 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-2591:
---
Summary: [Python] Segmentation fault issue in pq.write_table  (was: 
[Python] Segmentationfault issue in pq.write_table)

> [Python] Segmentation fault issue in pq.write_table
> ---
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-25 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698229#comment-16698229
 ] 

Krisztian Szucs commented on ARROW-2026:


What's the expectation here? So currently only NANO timestamps are supported 
for Int96 writing, should We support all of the units?

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)