Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?
The Buffer struct / metadata need not be a multiple of 8 bytes necessarily but you must write padding bytes when emitting the IPC protocol. So if your validity bitmap is 2 bytes in-memory then you must write at least 6 more bytes of padding on the wire. On Fri, Apr 26, 2019, 3:48 PM Micah Kornfield wrote: > Hi Neville, > Here is my understanding. Per the spec [1], 8 bytes of padding is > allowed/required but 64 bytes is recommended (Is "bits" in your e-mail is a > typo?). The main rationale is to allow SIMD instructions. > > For actual record batches only padding to a multiple of 8-bytes are > required [2]. > > Note that slicing of buffers still might require mem-copies/padding. > > Thanks, > Micah > > [1] https://arrow.apache.org/docs/format/Layout.html#alignment-and-padding > [2] https://arrow.apache.org/docs/format/IPC.html > > On Fri, Apr 26, 2019 at 1:29 PM Neville Dipale > wrote: > > > Hi Arrow developers, > > > > I'm currently working on IPC in Rust, specifically reading Arrow files. > > I've noticed that null buffers/bitmaps are always padded to 64 bits (from > > pyarrow, not sure about others), while in Rust we pad to 8 bits. > > > > 1. Is this fine re. Rust per the spec? > > > > I'm having issues with reading, but only because I'm comparing array data > > and not only the values and nullness of slots. I see this being more of a > > problem when writing to files and streams as we'd need to pad null > buffers > > almost every time (since for large arrays IPC could need 2048 while we > have > > 2046, so it's not a small data issue) > > > > 2. If implementations are allowed to choose either 8 or 64, are the Rust > > commiters happy with us changing to 64-bit padding? > > > > The benefits of changing to 64 would be removing the need to then pad the > > buffer when writing to streams and files, and it'll make us more > compatible > > with other implementations. I suspect this would still come as an issue > > when we get to add Rust to interop tests. > > > > I tried changing to 64-bit before writing this mail, but bit-fu is still > > beyond my knowledge, so I'd need help from someone else with implementing > > this, or at least letting me know which lines to change. I don't mind > then > > making sure all tests still pass. > > > > My goal is to complete IPC work by 0.14 release, so this would be a bit > > urgent as I'm stuck right now. > > > > Thanks > > Neville > > >
Re: [Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?
Hi Neville, Here is my understanding. Per the spec [1], 8 bytes of padding is allowed/required but 64 bytes is recommended (Is "bits" in your e-mail is a typo?). The main rationale is to allow SIMD instructions. For actual record batches only padding to a multiple of 8-bytes are required [2]. Note that slicing of buffers still might require mem-copies/padding. Thanks, Micah [1] https://arrow.apache.org/docs/format/Layout.html#alignment-and-padding [2] https://arrow.apache.org/docs/format/IPC.html On Fri, Apr 26, 2019 at 1:29 PM Neville Dipale wrote: > Hi Arrow developers, > > I'm currently working on IPC in Rust, specifically reading Arrow files. > I've noticed that null buffers/bitmaps are always padded to 64 bits (from > pyarrow, not sure about others), while in Rust we pad to 8 bits. > > 1. Is this fine re. Rust per the spec? > > I'm having issues with reading, but only because I'm comparing array data > and not only the values and nullness of slots. I see this being more of a > problem when writing to files and streams as we'd need to pad null buffers > almost every time (since for large arrays IPC could need 2048 while we have > 2046, so it's not a small data issue) > > 2. If implementations are allowed to choose either 8 or 64, are the Rust > commiters happy with us changing to 64-bit padding? > > The benefits of changing to 64 would be removing the need to then pad the > buffer when writing to streams and files, and it'll make us more compatible > with other implementations. I suspect this would still come as an issue > when we get to add Rust to interop tests. > > I tried changing to 64-bit before writing this mail, but bit-fu is still > beyond my knowledge, so I'd need help from someone else with implementing > this, or at least letting me know which lines to change. I don't mind then > making sure all tests still pass. > > My goal is to complete IPC work by 0.14 release, so this would be a bit > urgent as I'm stuck right now. > > Thanks > Neville >
[Rust] [Format] Should Null Bitmaps be Padded to 8 or 64 Bits?
Hi Arrow developers, I'm currently working on IPC in Rust, specifically reading Arrow files. I've noticed that null buffers/bitmaps are always padded to 64 bits (from pyarrow, not sure about others), while in Rust we pad to 8 bits. 1. Is this fine re. Rust per the spec? I'm having issues with reading, but only because I'm comparing array data and not only the values and nullness of slots. I see this being more of a problem when writing to files and streams as we'd need to pad null buffers almost every time (since for large arrays IPC could need 2048 while we have 2046, so it's not a small data issue) 2. If implementations are allowed to choose either 8 or 64, are the Rust commiters happy with us changing to 64-bit padding? The benefits of changing to 64 would be removing the need to then pad the buffer when writing to streams and files, and it'll make us more compatible with other implementations. I suspect this would still come as an issue when we get to add Rust to interop tests. I tried changing to 64-bit before writing this mail, but bit-fu is still beyond my knowledge, so I'd need help from someone else with implementing this, or at least letting me know which lines to change. I don't mind then making sure all tests still pass. My goal is to complete IPC work by 0.14 release, so this would be a bit urgent as I'm stuck right now. Thanks Neville
[jira] [Created] (ARROW-5222) [Python] Issues with installing pyarrow for development on MacOS
Neal Richardson created ARROW-5222: -- Summary: [Python] Issues with installing pyarrow for development on MacOS Key: ARROW-5222 URL: https://issues.apache.org/jira/browse/ARROW-5222 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Reporter: Neal Richardson Fix For: 0.14.0 I tried following the [instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst] for installing pyarrow for developers on macos, and I ran into quite a bit of difficulty. I'm hoping we can improve our documentation and/or tooling to make this a smoother process. I know we can't anticipate every quirk of everyone's dev environment, but in my case, I was getting set up on a new machine, so this was from a clean slate. I'm also new to contributing to the project, so I'm a "clean slate" in that regard too, so my ignorance may be exposing other assumptions in the docs. # The instructions recommend using conda, but as this [Stack Overflow question|https://stackoverflow.com/questions/55798166/cmake-fails-with-when-attempting-to-compile-simple-test-program] notes, cmake fails. Uwe helpfully suggested installing an older MacOS SDK from [here|https://github.com/phracker/MacOSX-SDKs/releases]. That may work, but I'm personally wary to install binaries from an unofficial github account, let alone record that in our docs as an official recommendation. Either way, we should update the docs either to note this necessity or to recommend against installing with conda on macos. # After that, I tried to go the Homebrew path. Ultimately this did succeed, but it was rough. It seemed that I had to `brew install` a lot of packages that weren't included in the arrow/python/Brewfile (i.e. try to cmake, see what missing dependency it failed on, `brew install` it, retry `cmake`, and repeat). Among the libs I installed this way were double-conversion snappy brotli protobuf gtest rapidjson flatbuffers lz4 zstd c-ares boost. It's not clear how many of these extra dependencies I had to install were because I'd only installed the xcode command-line tools and not the full xcode from the App Store; regardless, the Brewfile should be complete if we want to use it. # In searching Jira for the double-conversion issue (the first one I hit), I found [this issue/PR|https://github.com/apache/arrow/pull/4132/files], which added double-conversion to a different Brewfile, in c_glib. So I tried `brew bundle` installing that Brewfile. It would probably be good to have a common Brewfile for the C++ setup, which the python and glib ones could load and then add any other extra dependencies, if necessary. That way, there's one place to add common dependencies. # I got close here but still had issues with `BOOST_HOME` not being found, even though I had brew-installed it. From the console output, it appeared that even though I was not using conda and did not have an active conda environment (I'd even done `conda env remove --name pyarrow-dev`), the cmake configuration script detected that conda existed and decided to use conda to resolve dependencies. I tried setting lots of different environment variables to tell cmake not to use conda, but ultimately I was only able to get past this by deleting conda from my system entirely. # This let me get to the point of being able to `import pyarrow`. But then running tests failed because the `hypothesis` package was not installed. I see that it is included in requirements-test.txt and setup.py under tests_require, but I followed the installation instructions and this package did not end up in my virtualenv. `pip install hypothesis` resolved it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Need 64-bit Integer length for Parquet ByteArray Type
Hello Wes, Thanks for the info! I'm working to better understand Parquet/Arrow design and development processes. No hurry for LARGE_BYTE_ARRAY. -Brian On 4/26/19, 11:14 AM, "Wes McKinney" wrote: EXTERNAL hi Brian, I doubt that such a change could be made on a short time horizon. Collecting feedback and building consensus (if it is even possible) with stakeholders would take some time. The appropriate place to have the discussion is here on the mailing list, though Thanks On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman wrote: > > Hello Wes/all, > > A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives. Is this something that could be done in Parquet over the next few months? I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++. > > -Brian > > On 4/5/19, 3:44 PM, "Wes McKinney" wrote: > > EXTERNAL > > hi Brian, > > Just to comment from the C++ side -- the 64-bit issue is a limitation > of the Parquet format itself and not related to the C++ > implementation. It would be possibly interesting to add a > LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing > doing much the same in Apache Arrow for in-memory) > > - Wes > > On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue wrote: > > > > I don't think that's what you would want to do. Parquet will eventually > > compress large values, but not after making defensive copies and attempting > > to encode them. In the end, it will be a lot more overhead, plus the work > > to make it possible. I think you'd be much better of compressing before > > storing in Parquet if you expect good compression rates. > > > > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman wrote: > > > > > My hope is that these large ByteArray values will encode/compress to a > > > fraction of their original size. FWIW, cpp/src/parquet/ > > > column_writer.cc/.h has int64_t offset and length fields all over the > > > place. > > > > > > External file references to BLOBS is doable but not the elegant, > > > integrated solution I was hoping for. > > > > > > -Brian > > > > > > On Apr 5, 2019, at 1:53 PM, Ryan Blue wrote: > > > > > > *EXTERNAL* > > > Looks like we will need a new encoding for this: > > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > > > That doc specifies that the plain encoding uses a 4-byte length. That's > > > not going to be a quick fix. > > > > > > Now that I'm thinking about this a bit more, does it make sense to support > > > byte arrays that are more than 2GB? That's far larger than the size of a > > > row group, let alone a page. This would completely break memory management > > > in the JVM implementation. > > > > > > Can you solve this problem using a BLOB type that references an external > > > file with the gigantic values? Seems to me that values this large should go > > > in separate files, not in a Parquet file where it would destroy any benefit > > > from using the format. > > > > > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman wrote: > > > > > >> Hello Ryan, > > >> > > >> Looks like it's limited by both the Parquet implementation and the Thrift > > >> message methods. Am I missing anything? > > >> > > >> From cpp/src/parquet/types.h > > >> > > >> struct ByteArray { > > >> ByteArray() : len(0), ptr(NULLPTR) {} > > >> ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > > >> uint32_t len; > > >> const uint8_t* ptr; > > >> }; > > >> > > >> From cpp/src/parquet/thrift.h > > >> > > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* > > >> deserialized_msg) { > > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* > > >> out) > > >> > > >> -Brian > > >> > > >> On 4/5/19, 1:32 PM, "Ryan Blue" wrote: > > >> > > >> EXTERNAL > > >> > > >> Hi Brian, > > >> > > >> This seems like something we should allow. What imposes the current > > >> limit? > > >> Is it in the thrift format, or just the implementations? > > >> > > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman > > >> wrote: > > >> > > >> > All, > > >> > > > >>
Re: Need 64-bit Integer length for Parquet ByteArray Type
hi Brian, I doubt that such a change could be made on a short time horizon. Collecting feedback and building consensus (if it is even possible) with stakeholders would take some time. The appropriate place to have the discussion is here on the mailing list, though Thanks On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman wrote: > > Hello Wes/all, > > A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without > resorting to other alternatives. Is this something that could be done in > Parquet over the next few months? I have a lot of experience with file > formats/storage layer internals and can contribute for Parquet C++. > > -Brian > > On 4/5/19, 3:44 PM, "Wes McKinney" wrote: > > EXTERNAL > > hi Brian, > > Just to comment from the C++ side -- the 64-bit issue is a limitation > of the Parquet format itself and not related to the C++ > implementation. It would be possibly interesting to add a > LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing > doing much the same in Apache Arrow for in-memory) > > - Wes > > On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue > wrote: > > > > I don't think that's what you would want to do. Parquet will eventually > > compress large values, but not after making defensive copies and > attempting > > to encode them. In the end, it will be a lot more overhead, plus the > work > > to make it possible. I think you'd be much better of compressing before > > storing in Parquet if you expect good compression rates. > > > > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman > wrote: > > > > > My hope is that these large ByteArray values will encode/compress to a > > > fraction of their original size. FWIW, cpp/src/parquet/ > > > column_writer.cc/.h has int64_t offset and length fields all over the > > > place. > > > > > > External file references to BLOBS is doable but not the elegant, > > > integrated solution I was hoping for. > > > > > > -Brian > > > > > > On Apr 5, 2019, at 1:53 PM, Ryan Blue wrote: > > > > > > *EXTERNAL* > > > Looks like we will need a new encoding for this: > > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > > > That doc specifies that the plain encoding uses a 4-byte length. > That's > > > not going to be a quick fix. > > > > > > Now that I'm thinking about this a bit more, does it make sense to > support > > > byte arrays that are more than 2GB? That's far larger than the size > of a > > > row group, let alone a page. This would completely break memory > management > > > in the JVM implementation. > > > > > > Can you solve this problem using a BLOB type that references an > external > > > file with the gigantic values? Seems to me that values this large > should go > > > in separate files, not in a Parquet file where it would destroy any > benefit > > > from using the format. > > > > > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman > wrote: > > > > > >> Hello Ryan, > > >> > > >> Looks like it's limited by both the Parquet implementation and the > Thrift > > >> message methods. Am I missing anything? > > >> > > >> From cpp/src/parquet/types.h > > >> > > >> struct ByteArray { > > >> ByteArray() : len(0), ptr(NULLPTR) {} > > >> ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > > >> uint32_t len; > > >> const uint8_t* ptr; > > >> }; > > >> > > >> From cpp/src/parquet/thrift.h > > >> > > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, > T* > > >> deserialized_msg) { > > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* > > >> out) > > >> > > >> -Brian > > >> > > >> On 4/5/19, 1:32 PM, "Ryan Blue" wrote: > > >> > > >> EXTERNAL > > >> > > >> Hi Brian, > > >> > > >> This seems like something we should allow. What imposes the > current > > >> limit? > > >> Is it in the thrift format, or just the implementations? > > >> > > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman > > > >> wrote: > > >> > > >> > All, > > >> > > > >> > SAS requires support for storing varying-length character and > > >> binary blobs > > >> > with a 2^64 max length in Parquet. Currently, the ByteArray > len > > >> field is > > >> > a unint32_t. Looks this the will require incrementing the > Parquet > > >> file > > >> > format version and changing ByteArray len to uint64_t. > > >> > > > >> > Have there been any requests for this or other Parquet > developments > > >> that > > >> > require file format versioning changes? > > >> > > > >> > I realize this a non-trivial ask. Thanks for
Re: [VOTE] Add 64-bit offset list, binary, string (utf8) types to the Arrow columnar format
Can non-Arrow PMC members/committers vote? If so, +1 -Brian On 4/25/19, 4:34 PM, "Wes McKinney" wrote: EXTERNAL In a recent mailing list discussion [1] Micah Kornfield has proposed to add new list and variable-size binary and unicode types to the Arrow columnar format with 64-bit signed integer offsets, to be used in addition to the existing 32-bit offset varieties. These will be implemented as new types in the Type union in Schema.fbs (the particular names can be debated in the PR that implements them): LargeList LargeBinary LargeString [UTF8] While very large contiguous columns are not a principle use case for the columnar format, it has been observed empirically that there are applications that use the format to represent datasets where realizations of data can sometimes exceed the 2^31 - 1 "capacity" of a column and cannot be easily (or at all) split into smaller chunks. Please vote whether to accept the changes. The vote will be open for at least 72 hours. [ ] +1 Accept the additions to the columnar format [ ] +0 [ ] -1 Do not accept the changes because... Thanks, Wes [1]: https://lists.apache.org/thread.html/8088eca21b53906315e2bbc35eb2d246acf10025b5457eccc7a0e8a3@%3Cdev.arrow.apache.org%3E
[jira] [Created] (ARROW-5221) Improvement the performance of class SegmentsUtil
Liya Fan created ARROW-5221: --- Summary: Improvement the performance of class SegmentsUtil Key: ARROW-5221 URL: https://issues.apache.org/jira/browse/ARROW-5221 Project: Apache Arrow Issue Type: Improvement Reporter: Liya Fan Assignee: Liya Fan Improve the performance of class SegmentsUtil from two points: # In method allocateReuseBytes, the generated byte array should be cached for reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array is not cached if bytes.length < length, and this will lead to performance overhead: if (bytes == null) { if (length <= MAX_BYTES_LENGTH) { bytes = new byte[MAX_BYTES_LENGTH]; BYTES_LOCAL.set(bytes); } else { bytes = new byte[length]; } } else if (bytes.length < length) { bytes = new byte[length]; } 2. To evaluate the offset, an integer is bitand with a mask to clear to low bits, and then shift right. The bitand is useless: ((index & BIT_BYTE_POSITION_MASK) >>> 3) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas
Joris Van den Bossche created ARROW-5220: Summary: [Python] index / unknown columns in specified schema in Table.from_pandas Key: ARROW-5220 URL: https://issues.apache.org/jira/browse/ARROW-5220 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The {{Table.from_pandas}} method allows to specify a schema ("This can be used to indicate the type of columns if we cannot infer it automatically."). But, if you also want to specify the type of the index, you get an error: {code:python} df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]}) df.index = pd.Index(['a', 'b', 'c'], name='index') my_schema = pa.schema([('index', pa.string()), ('a', pa.int64()), ('b', pa.float64()), ]) table = pa.Table.from_pandas(df, schema=my_schema) {code} gives {{KeyError: 'index'}} (because it tries to look up the "column names" from the schema in the dataframe, and thus does not find column 'index'). This also has the consequence that re-using the schema does not work: {{table1 = pa.Table.from_pandas(df1); table2 = pa.Table.from_pandas(df2, schema=table1.schema)}} Extra note: also unknown columns in general give this error (column specified in the schema that are not in the dataframe). At least in pyarrow 0.11, this did not give an error (eg noticed this from the code in example in ARROW-3861). So before, unknown columns in the specified schema were ignored, while now they raise an error. Was this a conscious change? So before also specifying the index in the schema "worked" in the sense that it didn't raise an error, but it was also ignored, so didn't actually do what you would expect) Questions: - I think that we should support specifying the index in the passed {{schema}} ? So that the example above works (although this might be complicated with RangeIndex that is not serialized any more) - But what to do in general with additional columns in the schema that are not in the DataFrame? Are we fine with keep raising an error as it is now (the error message could be improved then)? Or do we again want to ignore them? (or, it could actually also add them as all nulls to the table) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Use arrow as a general data serialization framework in distributed stream data processing
It's "arbitrary" from Arrow's point of view, because Arrow itself cannot represent this data (except as a binary blob). Though, as Micah said, this may change at some point. Instead of extending Arrow to fit this use case, perhaps it would be better to write a separate library that sits atop Arrow for your purposes? Regards Antoine. Le 26/04/2019 à 04:20, Shawn Yang a écrit : > Hi Antoine, > It's not arbitrary data type, it's the type similar to data types in > https://spark.apache.org/docs/latest/sql-reference.html#data-types and > https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/sql.html#data-types. > Our framework is a framework that is similar to > flink streaming, but is written in c++/java/python. And data need to be > transferred from java process to python process by tcp or shared > memory if they are on the same machine. For example, one case is online > learning, the features is generated in java streaming, and > then training data is transferred to python tensorflow worker for training. > In system such as flink, data is row by row, not columnar, so there need a > serialization framework > to serialize data row by row in language-independent way for > c++/java/python. > > Regards