Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-26 Thread Antoine Pitrou
It's "arbitrary" from Arrow's point of view, because Arrow itself cannot represent this data (except as a binary blob). Though, as Micah said, this may change at some point. Instead of extending Arrow to fit this use case, perhaps it would be better to write a separate library that sits atop Ar

[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5220: Summary: [Python] index / unknown columns in specified schema in Table.from_pandas Key: ARROW-5220 URL: https://issues.apache.org/jira/browse/ARROW-5220

[jira] [Created] (ARROW-5221) Improvement the performance of class SegmentsUtil

2019-04-26 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5221: --- Summary: Improvement the performance of class SegmentsUtil Key: ARROW-5221 URL: https://issues.apache.org/jira/browse/ARROW-5221 Project: Apache Arrow Issue Type: Impr

Re: [VOTE] Add 64-bit offset list, binary, string (utf8) types to the Arrow columnar format

2019-04-26 Thread Brian Bowman
Can non-Arrow PMC members/committers vote? If so, +1 -Brian On 4/25/19, 4:34 PM, "Wes McKinney" wrote: EXTERNAL In a recent mailing list discussion [1] Micah Kornfield has proposed to add new list and variable-size binary and unicode types to the Arrow columnar format

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Wes McKinney
hi Brian, I doubt that such a change could be made on a short time horizon. Collecting feedback and building consensus (if it is even possible) with stakeholders would take some time. The appropriate place to have the discussion is here on the mailing list, though Thanks On Mon, Apr 8, 2019 at 1

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Brian Bowman
Hello Wes, Thanks for the info! I'm working to better understand Parquet/Arrow design and development processes. No hurry for LARGE_BYTE_ARRAY. -Brian On 4/26/19, 11:14 AM, "Wes McKinney" wrote: EXTERNAL hi Brian, I doubt that such a change could be made on a short

[jira] [Created] (ARROW-5222) [Python] Issues with installing pyarrow for development on MacOS

2019-04-26 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5222: -- Summary: [Python] Issues with installing pyarrow for development on MacOS Key: ARROW-5222 URL: https://issues.apache.org/jira/browse/ARROW-5222 Project: Apache Ar

[Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Neville Dipale
Hi Arrow developers, I'm currently working on IPC in Rust, specifically reading Arrow files. I've noticed that null buffers/bitmaps are always padded to 64 bits (from pyarrow, not sure about others), while in Rust we pad to 8 bits. 1. Is this fine re. Rust per the spec? I'm having issues with re

Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Micah Kornfield
Hi Neville, Here is my understanding. Per the spec [1], 8 bytes of padding is allowed/required but 64 bytes is recommended (Is "bits" in your e-mail is a typo?). The main rationale is to allow SIMD instructions. For actual record batches only padding to a multiple of 8-bytes are required [2]. N

Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?

2019-04-26 Thread Wes McKinney
The Buffer struct / metadata need not be a multiple of 8 bytes necessarily but you must write padding bytes when emitting the IPC protocol. So if your validity bitmap is 2 bytes in-memory then you must write at least 6 more bytes of padding on the wire. On Fri, Apr 26, 2019, 3:48 PM Micah Kornfiel