[jira] [Created] (ARROW-2032) [C++] ORC ep installs on each call to ninja build (even if no work to do)

2018-01-24 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2032: --- Summary: [C++] ORC ep installs on each call to ninja build (even if no work to do) Key: ARROW-2032 URL: https://issues.apache.org/jira/browse/ARROW-2032 Project: Apache

[jira] [Created] (ARROW-2031) HadoopFileSystem isn't pickleable

2018-01-24 Thread Jim Crist (JIRA)
Jim Crist created ARROW-2031: Summary: HadoopFileSystem isn't pickleable Key: ARROW-2031 URL: https://issues.apache.org/jira/browse/ARROW-2031 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-2030) NativeFile's Attributes are not exposed in child classes without explicit initialization

2018-01-24 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2030: Summary: NativeFile's Attributes are not exposed in child classes without explicit initialization Key: ARROW-2030 URL: https://issues.apache.org/jira/browse/ARROW-2030

[jira] [Created] (ARROW-2029) [Python] Program crash on `HdfsFile.tell` if file is closed

2018-01-24 Thread Jim Crist (JIRA)
Jim Crist created ARROW-2029: Summary: [Python] Program crash on `HdfsFile.tell` if file is closed Key: ARROW-2029 URL: https://issues.apache.org/jira/browse/ARROW-2029 Project: Apache Arrow Iss

Re: Help triaging Arrow GitHub issues

2018-01-24 Thread Uwe L. Korn
Thank you Wes for cleaning most of them up! We now got down to 3. One of them has an active discussion, we will probably this soon to JIRA. The next about time drifts is something I think I have also seen with a turbodbc user (independent of Arrow) so I'll probably look a bit deeper into that

[jira] [Created] (ARROW-2028) [Python] extra_cmake_args needs to be passed through shlex.split

2018-01-24 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2028: -- Summary: [Python] extra_cmake_args needs to be passed through shlex.split Key: ARROW-2028 URL: https://issues.apache.org/jira/browse/ARROW-2028 Project: Apache Arrow

[jira] [Created] (ARROW-2027) [C++] ipc::Message::SerializeTo does not pad the message body

2018-01-24 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2027: --- Summary: [C++] ipc::Message::SerializeTo does not pad the message body Key: ARROW-2027 URL: https://issues.apache.org/jira/browse/ARROW-2027 Project: Apache Arrow

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Daniel Lemire
Here are some realistic tabular data sets... https://github.com/lemire/RealisticTabularDataSets They are small by modern standards but they are also one GitHub clone away. - Daniel On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney wrote: > Thanks Ted. I will echo these comments and recommend to r

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Wes McKinney
Thanks Ted. I will echo these comments and recommend to run tests on larger and preferably "real" datasets rather than randomly generated ones. The more repetition and less entropy in a dataset, the better Parquet performs relative to other storage options. Web-scale datasets often exhibit these ch

[jira] [Created] (ARROW-2026) Timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-01-24 Thread Diego Argueta (JIRA)
Diego Argueta created ARROW-2026: Summary: Timestamps saved as int64 even if use_deprecated_int96_timestamps=True Key: ARROW-2026 URL: https://issues.apache.org/jira/browse/ARROW-2026 Project: Apache

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Ted Dunning
Simba Nice summary. I think that there may be some issues with your tests. In particular, you are storing essentially uniform random values. That might be a viable test in some situations, there are many where there is considerably less entropy in the data being stored. For instance, if you store

Re: Arrow sync today at 17:00 UTC

2018-01-24 Thread Wes McKinney
Brief meeting today. Attendees and topics discussed as follows: Attendees - Wes (Two Sigma) - Expand Interval metadata in format spec - 0.9.0 milestone - Simba - Uwe (Blue Yonder) - C++ - Li (Two Sigma) - Dwight (Revirda) - Sidd (Dremio) - Struct change merge - Interval - Phillip Cloud

[jira] [Created] (ARROW-2025) [Python/C++] HDFS Client disconnect closes all open clients

2018-01-24 Thread Jim Crist (JIRA)
Jim Crist created ARROW-2025: Summary: [Python/C++] HDFS Client disconnect closes all open clients Key: ARROW-2025 URL: https://issues.apache.org/jira/browse/ARROW-2025 Project: Apache Arrow Iss

Re: Arrow sync today at 17:00 UTC

2018-01-24 Thread Bryan Cutler
I can't make the sync today, will catch up later. Bryan On Jan 24, 2018 6:30 AM, "Wes McKinney" wrote: > Join us at https://meet.google.com/vtm-teks-phx >

Arrow sync today at 17:00 UTC

2018-01-24 Thread Wes McKinney
Join us at https://meet.google.com/vtm-teks-phx

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Uwe, thanks. I've attached a Google Sheet link https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0 Kind Regards Simba On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn wrote: > Hello Simba, > > your plots did not come through. Try uploading them somewhere

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread Uwe L. Korn
Hello Simba, your plots did not come through. Try uploading them somewhere and link to them in the mails. Attachments are always stripped on Apache mailing lists. Uwe On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: > Hi Everyone, > > I did some benchmarking to compare the disk size per

[Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Everyone, I did some benchmarking to compare the disk size performance when writing Pandas DataFrames to parquet files using Snappy and Brotli compression. I then compared these numbers with those of my current file storage solution. In my current (non Arrow+Parquet solution), every column in

Typelib file for namespace 'Arrow' not found for go examples

2018-01-24 Thread Mike Sam
Hi, I am trying to use the arrow go and follow the following instructions at https://github.com/apache/arrow/tree/master/c_glib/example/go everything goes ok until I am trying to do % git clone https://github.com/apache/arrow.git ~/arrow % cd ~/arrow/c_glib/example/go % make generate This retu