[jira] [Created] (ARROW-9820) Plugin Architecture for Filesystem and File IO

2020-08-21 Thread Lawrence Chan (Jira)
Lawrence Chan created ARROW-9820:


 Summary: Plugin Architecture for Filesystem and File IO
 Key: ARROW-9820
 URL: https://issues.apache.org/jira/browse/ARROW-9820
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Lawrence Chan


Adding a new custom filesystem with corresponding file i/o streams is quite a 
process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
many places.  It would be useful to develop a plugin system to allow users to 
interface with other data stores without maintaining a permanent fork with 
hardcoded changes.

We can either do runtime plugins or compile-time plugins.  Runtime is more 
user-friendly, but with C++, ABI compatibility is fairly delicate.  So we would 
either want to use a C ABI or accept a youre-on-your-own situation where the 
user is expected to be very careful with versioning and compiler flags.

With compile-time plugins, maybe there's a way to have the cmake machinery 
build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9816) Esacpe quotes in config.h

2020-08-20 Thread Lawrence Chan (Jira)
Lawrence Chan created ARROW-9816:


 Summary: Esacpe quotes in config.h
 Key: ARROW-9816
 URL: https://issues.apache.org/jira/browse/ARROW-9816
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.0, 1.0.1
Reporter: Lawrence Chan


Currently the config.h file is generated without the `ESCAPE_QUOTES` option, 
which cases quotes in e.g. CXXFLAGS to break config.h parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5167) [C++] Upgrade string-view-light to latest

2019-04-12 Thread Lawrence Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-5167:
-
Summary: [C++] Upgrade string-view-light to latest  (was: Upgrade 
string-view-light to latest)

> [C++] Upgrade string-view-light to latest
> -
>
> Key: ARROW-5167
> URL: https://issues.apache.org/jira/browse/ARROW-5167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lawrence Chan
>Priority: Minor
>
> string-view-lite had a typo in one of its macros (fixed in 
> https://github.com/martinmoene/string-view-lite/commit/2f2cce35293b0027056e5449b2c05b5f9c3e89ff).
>   We should vendor the latest version in the next Arrow release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5167) Upgrade string-view-light to latest

2019-04-12 Thread Lawrence Chan (JIRA)
Lawrence Chan created ARROW-5167:


 Summary: Upgrade string-view-light to latest
 Key: ARROW-5167
 URL: https://issues.apache.org/jira/browse/ARROW-5167
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lawrence Chan


string-view-lite had a typo in one of its macros (fixed in 
https://github.com/martinmoene/string-view-lite/commit/2f2cce35293b0027056e5449b2c05b5f9c3e89ff).
  We should vendor the latest version in the next Arrow release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec

2019-03-20 Thread Lawrence Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797398#comment-16797398
 ] 

Lawrence Chan edited comment on ARROW-2290 at 3/20/19 6:08 PM:
---

Maybe I don't have permissions, or maybe I'm overlooking something, but I can't 
seem to edit the issue title?

EDIT: derp, I'm not the owner, Wes opened this on my behalf in converting my GH 
issue.


was (Author: llchan):
Maybe I don't have permissions, or maybe I'm overlooking something, but I can't 
seem to edit the issue title?

> [C++/Python] Add ability to set codec options for lz4 codec
> ---
>
> Key: ARROW-2290
> URL: https://issues.apache.org/jira/browse/ARROW-2290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>
> The LZ4 library has many parameters, currently we do not expose these in C++ 
> or Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec

2019-03-20 Thread Lawrence Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797398#comment-16797398
 ] 

Lawrence Chan commented on ARROW-2290:
--

Maybe I don't have permissions, or maybe I'm overlooking something, but I can't 
seem to edit the issue title?

> [C++/Python] Add ability to set codec options for lz4 codec
> ---
>
> Key: ARROW-2290
> URL: https://issues.apache.org/jira/browse/ARROW-2290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>
> The LZ4 library has many parameters, currently we do not expose these in C++ 
> or Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4568) Add version macros to headers

2019-02-13 Thread Lawrence Chan (JIRA)
Lawrence Chan created ARROW-4568:


 Summary: Add version macros to headers
 Key: ARROW-4568
 URL: https://issues.apache.org/jira/browse/ARROW-4568
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Lawrence Chan


It would be useful to have compile-time macros in the headers specifying the 
major/minor/patch versions, so that users can more easily maintain code that 
can be built with a range of arrow versions.

Other nice-to-haves:
- Maybe a "combiner" func that basically spits out the value as an easy to 
compare integer e.g. 12000 for 0.12.0 or something.
- Git hash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2296) [C++] Add num_rows to file footer

2018-03-12 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395564#comment-16395564
 ] 

Lawrence Chan edited comment on ARROW-2296 at 3/12/18 5:37 PM:
---

Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk 
all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current 
implementation is inside a .cc file and I'd have to either copy+paste or modify 
my build to expose more of the existing code. Maybe we could expose something 
like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) 
const;
{code}
Then it'd be possible to read the length fields without copying a bunch of 
code. Not sure if this is a good idea though, since it seems that we dont 
usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in 
there, and walk the batches in the current format?


was (Author: llchan):
Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk 
all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current 
implementation is inside a .cc file and I'd have to either copy+paste or modify 
my build to expose more of the existing code. Maybe we could expose something 
like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) 
const;
{code}
Then it'd be possible to read the length fields without copying some of the 
other stuff. Not sure if this is a good idea though, since it seems that we 
dont usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in 
there, and walk the batches in the current format?

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2296) Add num_rows to file footer

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2296:
-
Description: 
Maybe I'm overlooking something, but I don't see something on the API surface 
to get the number of rows in a arrow file without reading all the record 
batches. This is useful when we want to read into contiguous buffers, because 
it allows us to allocate the right sizes up front.

I'd like to propose that we add `num_rows` as a field in the file footer so 
it's easy to query without reading the whole file.

Meanwhile, before we get that added to the official format fbs, it would be 
nice to have a method that iterates over the record batch headers and sums up 
the lengths without reading the actual record batch body.

  was:
Maybe I'm overlooking something, but I don't see something on the API surface 
to get the number of rows in a arrow file without reading all the record 
batches.

I'd like to propose that we add `num_rows` as a field in the file footer so 
it's easy to query without reading the whole file.

Meanwhile, before we get that added to the official format fbs, it would be 
nice to have a method that iterates over the record batch headers and sums up 
the lengths without reading the actual record batch body.


> Add num_rows to file footer
> ---
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2296) Add num_rows to file footer

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2296:
-
Component/s: C++

> Add num_rows to file footer
> ---
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches.
> I'd like to propose that we add `num_rows` as a field to the footer so it's 
> easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2296) Add num_rows to file footer

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2296:
-
Component/s: Format

> Add num_rows to file footer
> ---
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches.
> I'd like to propose that we add `num_rows` as a field to the footer so it's 
> easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2296) Add num_rows to file footer

2018-03-09 Thread Lawrence Chan (JIRA)
Lawrence Chan created ARROW-2296:


 Summary: Add num_rows to file footer
 Key: ARROW-2296
 URL: https://issues.apache.org/jira/browse/ARROW-2296
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Lawrence Chan


Maybe I'm overlooking something, but I don't see something on the API surface 
to get the number of rows in a arrow file without reading all the record 
batches.

I'd like to propose that we add `num_rows` as a field to the footer so it's 
easy to query without reading the whole file.

Meanwhile, before we get that added to the official format fbs, it would be 
nice to have a method that iterates over the record batch headers and sums up 
the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2296) Add num_rows to file footer

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2296:
-
Description: 
Maybe I'm overlooking something, but I don't see something on the API surface 
to get the number of rows in a arrow file without reading all the record 
batches.

I'd like to propose that we add `num_rows` as a field in the file footer so 
it's easy to query without reading the whole file.

Meanwhile, before we get that added to the official format fbs, it would be 
nice to have a method that iterates over the record batch headers and sums up 
the lengths without reading the actual record batch body.

  was:
Maybe I'm overlooking something, but I don't see something on the API surface 
to get the number of rows in a arrow file without reading all the record 
batches.

I'd like to propose that we add `num_rows` as a field to the footer so it's 
easy to query without reading the whole file.

Meanwhile, before we get that added to the official format fbs, it would be 
nice to have a method that iterates over the record batch headers and sums up 
the lengths without reading the actual record batch body.


> Add num_rows to file footer
> ---
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2295) Add to_numpy functions

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2295:
-
Description: 
There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
propose that we include both.

Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is 
very confusing :). I think it would be more intuitive for the `to_pandas()` 
functions to return `pandas.Series` and `pandas.DataFrame` objects, and the 
`to_numpy()` functions to return `numpy.ndarray` and either a dict of 
`numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for 
example. The `to_pandas()` function is of course welcome to use the 
`to_numpy()` func to avoid the additional index and whatnot of the 
`pandas.Series`.

 

  was:
There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
propose that we include both.

Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is 
very confusing :). I think it would be more intuitive for the `to_pandas()` 
functions to return `pandas.Series` and `pandas.DataFrame` objects, and the 
`to_numpy()` functions to return `numpy.ndarray` and either a dict of 
`numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for 
example. The `to_pandas()` function is of course welcome to use the 
`to_numpy()` func to avoid the additional indexes and whatnot of the 
`pandas.Series`.

 


> Add to_numpy functions
> --
>
> Key: ARROW-2295
> URL: https://issues.apache.org/jira/browse/ARROW-2295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Lawrence Chan
>Priority: Minor
>
> There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
> propose that we include both.
> Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho 
> is very confusing :). I think it would be more intuitive for the 
> `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` 
> objects, and the `to_numpy()` functions to return `numpy.ndarray` and either 
> a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a 
> flag, for example. The `to_pandas()` function is of course welcome to use the 
> `to_numpy()` func to avoid the additional index and whatnot of the 
> `pandas.Series`.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2295) Add to_numpy functions

2018-03-09 Thread Lawrence Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lawrence Chan updated ARROW-2295:
-
Description: 
There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
propose that we include both.

Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is 
very confusing :). I think it would be more intuitive for the `to_pandas()` 
functions to return `pandas.Series` and `pandas.DataFrame` objects, and the 
`to_numpy()` functions to return `numpy.ndarray` and either a ordered dict of 
`numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for 
example. The `to_pandas()` function is of course welcome to use the 
`to_numpy()` func to avoid the additional index and whatnot of the 
`pandas.Series`.

 

  was:
There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
propose that we include both.

Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is 
very confusing :). I think it would be more intuitive for the `to_pandas()` 
functions to return `pandas.Series` and `pandas.DataFrame` objects, and the 
`to_numpy()` functions to return `numpy.ndarray` and either a dict of 
`numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for 
example. The `to_pandas()` function is of course welcome to use the 
`to_numpy()` func to avoid the additional index and whatnot of the 
`pandas.Series`.

 


> Add to_numpy functions
> --
>
> Key: ARROW-2295
> URL: https://issues.apache.org/jira/browse/ARROW-2295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Lawrence Chan
>Priority: Minor
>
> There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
> propose that we include both.
> Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho 
> is very confusing :). I think it would be more intuitive for the 
> `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` 
> objects, and the `to_numpy()` functions to return `numpy.ndarray` and either 
> a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending 
> on a flag, for example. The `to_pandas()` function is of course welcome to 
> use the `to_numpy()` func to avoid the additional index and whatnot of the 
> `pandas.Series`.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2295) Add to_numpy functions

2018-03-09 Thread Lawrence Chan (JIRA)
Lawrence Chan created ARROW-2295:


 Summary: Add to_numpy functions
 Key: ARROW-2295
 URL: https://issues.apache.org/jira/browse/ARROW-2295
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Lawrence Chan


There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to 
propose that we include both.

Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is 
very confusing :). I think it would be more intuitive for the `to_pandas()` 
functions to return `pandas.Series` and `pandas.DataFrame` objects, and the 
`to_numpy()` functions to return `numpy.ndarray` and either a dict of 
`numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for 
example. The `to_pandas()` function is of course welcome to use the 
`to_numpy()` func to avoid the additional indexes and whatnot of the 
`pandas.Series`.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:09 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. I tried to hack it up with FixedLenByteArray but there are a 
slew of complications with that, not to mention alignment concerns etc.

Anyways I'm happy to help on this, but I'm not familiar enough with the code 
base to place it in the right spot. If we make a branch with some 
TODOs/placeholders I can probably plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:00 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/8/18 11:46 PM:
--

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392124#comment-16392124
 ] 

Lawrence Chan commented on ARROW-300:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)