[jira] [Created] (ARROW-9820) Plugin Architecture for Filesystem and File IO
Lawrence Chan created ARROW-9820: Summary: Plugin Architecture for Filesystem and File IO Key: ARROW-9820 URL: https://issues.apache.org/jira/browse/ARROW-9820 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Lawrence Chan Adding a new custom filesystem with corresponding file i/o streams is quite a process at the moment. Looks like HDFS and S3FS are basically hardcoded in many places. It would be useful to develop a plugin system to allow users to interface with other data stores without maintaining a permanent fork with hardcoded changes. We can either do runtime plugins or compile-time plugins. Runtime is more user-friendly, but with C++, ABI compatibility is fairly delicate. So we would either want to use a C ABI or accept a youre-on-your-own situation where the user is expected to be very careful with versioning and compiler flags. With compile-time plugins, maybe there's a way to have the cmake machinery build third party code and also register those new URI schemes automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9816) Esacpe quotes in config.h
Lawrence Chan created ARROW-9816: Summary: Esacpe quotes in config.h Key: ARROW-9816 URL: https://issues.apache.org/jira/browse/ARROW-9816 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.0, 1.0.1 Reporter: Lawrence Chan Currently the config.h file is generated without the `ESCAPE_QUOTES` option, which cases quotes in e.g. CXXFLAGS to break config.h parsing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5167) [C++] Upgrade string-view-light to latest
[ https://issues.apache.org/jira/browse/ARROW-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-5167: - Summary: [C++] Upgrade string-view-light to latest (was: Upgrade string-view-light to latest) > [C++] Upgrade string-view-light to latest > - > > Key: ARROW-5167 > URL: https://issues.apache.org/jira/browse/ARROW-5167 > Project: Apache Arrow > Issue Type: Bug >Reporter: Lawrence Chan >Priority: Minor > > string-view-lite had a typo in one of its macros (fixed in > https://github.com/martinmoene/string-view-lite/commit/2f2cce35293b0027056e5449b2c05b5f9c3e89ff). > We should vendor the latest version in the next Arrow release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5167) Upgrade string-view-light to latest
Lawrence Chan created ARROW-5167: Summary: Upgrade string-view-light to latest Key: ARROW-5167 URL: https://issues.apache.org/jira/browse/ARROW-5167 Project: Apache Arrow Issue Type: Bug Reporter: Lawrence Chan string-view-lite had a typo in one of its macros (fixed in https://github.com/martinmoene/string-view-lite/commit/2f2cce35293b0027056e5449b2c05b5f9c3e89ff). We should vendor the latest version in the next Arrow release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec
[ https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797398#comment-16797398 ] Lawrence Chan edited comment on ARROW-2290 at 3/20/19 6:08 PM: --- Maybe I don't have permissions, or maybe I'm overlooking something, but I can't seem to edit the issue title? EDIT: derp, I'm not the owner, Wes opened this on my behalf in converting my GH issue. was (Author: llchan): Maybe I don't have permissions, or maybe I'm overlooking something, but I can't seem to edit the issue title? > [C++/Python] Add ability to set codec options for lz4 codec > --- > > Key: ARROW-2290 > URL: https://issues.apache.org/jira/browse/ARROW-2290 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > > The LZ4 library has many parameters, currently we do not expose these in C++ > or Python -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec
[ https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797398#comment-16797398 ] Lawrence Chan commented on ARROW-2290: -- Maybe I don't have permissions, or maybe I'm overlooking something, but I can't seem to edit the issue title? > [C++/Python] Add ability to set codec options for lz4 codec > --- > > Key: ARROW-2290 > URL: https://issues.apache.org/jira/browse/ARROW-2290 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > > The LZ4 library has many parameters, currently we do not expose these in C++ > or Python -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4568) Add version macros to headers
Lawrence Chan created ARROW-4568: Summary: Add version macros to headers Key: ARROW-4568 URL: https://issues.apache.org/jira/browse/ARROW-4568 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Lawrence Chan It would be useful to have compile-time macros in the headers specifying the major/minor/patch versions, so that users can more easily maintain code that can be built with a range of arrow versions. Other nice-to-haves: - Maybe a "combiner" func that basically spits out the value as an easy to compare integer e.g. 12000 for 0.12.0 or something. - Git hash -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2296) [C++] Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395564#comment-16395564 ] Lawrence Chan edited comment on ARROW-2296 at 3/12/18 5:37 PM: --- Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk all the batches to sum them up. Also they are indeed in the existing RecordBatch metadata, but the current implementation is inside a .cc file and I'd have to either copy+paste or modify my build to expose more of the existing code. Maybe we could expose something like this on the RecordBatchFileReader? {code:cpp} Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) const; {code} Then it'd be possible to read the length fields without copying a bunch of code. Not sure if this is a good idea though, since it seems that we dont usually expose the flatbuffers through the public API. Maybe just a {code:cpp} int64_t num_rows() const; {code} is all I really want, and that can read the new Footer field once it's in there, and walk the batches in the current format? was (Author: llchan): Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk all the batches to sum them up. Also they are indeed in the existing RecordBatch metadata, but the current implementation is inside a .cc file and I'd have to either copy+paste or modify my build to expose more of the existing code. Maybe we could expose something like this on the RecordBatchFileReader? {code:cpp} Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) const; {code} Then it'd be possible to read the length fields without copying some of the other stuff. Not sure if this is a good idea though, since it seems that we dont usually expose the flatbuffers through the public API. Maybe just a {code:cpp} int64_t num_rows() const; {code} is all I really want, and that can read the new Footer field once it's in there, and walk the batches in the current format? > [C++] Add num_rows to file footer > - > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. This is useful when we want to read into contiguous buffers, because > it allows us to allocate the right sizes up front. > I'd like to propose that we add `num_rows` as a field in the file footer so > it's easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Description: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. This is useful when we want to read into contiguous buffers, because it allows us to allocate the right sizes up front. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. was: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. This is useful when we want to read into contiguous buffers, because > it allows us to allocate the right sizes up front. > I'd like to propose that we add `num_rows` as a field in the file footer so > it's easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Component/s: C++ > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field to the footer so it's > easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Component/s: Format > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field to the footer so it's > easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2296) Add num_rows to file footer
Lawrence Chan created ARROW-2296: Summary: Add num_rows to file footer Key: ARROW-2296 URL: https://issues.apache.org/jira/browse/ARROW-2296 Project: Apache Arrow Issue Type: Improvement Reporter: Lawrence Chan Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field to the footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Description: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. was: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field to the footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field in the file footer so > it's easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2295: - Description: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. was: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional indexes and whatnot of the `pandas.Series`. > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a > flag, for example. The `to_pandas()` function is of course welcome to use the > `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2295: - Description: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. was: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending > on a flag, for example. The `to_pandas()` function is of course welcome to > use the `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2295) Add to_numpy functions
Lawrence Chan created ARROW-2295: Summary: Add to_numpy functions Key: ARROW-2295 URL: https://issues.apache.org/jira/browse/ARROW-2295 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Lawrence Chan There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional indexes and whatnot of the `pandas.Series`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:09 AM: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. I tried to hack it up with FixedLenByteArray but there are a slew of complications with that, not to mention alignment concerns etc. Anyways I'm happy to help on this, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:00 AM: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/8/18 11:46 PM: -- What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that latter approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124 ] Lawrence Chan commented on ARROW-300: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that latter approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)