[jira] [Created] (ARROW-8055) [GLib][Ruby] Add some metadata bindings to GArrowSchema
Kouhei Sutou created ARROW-8055: --- Summary: [GLib][Ruby] Add some metadata bindings to GArrowSchema Key: ARROW-8055 URL: https://issues.apache.org/jira/browse/ARROW-8055 Project: Apache Arrow Issue Type: Improvement Components: GLib, Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Summary of RLE and other compression efforts?
Hey Evan, thank you for the interest. There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is used. It only works well for high-entropy floating-point data, somewhere at least as large as >= 15 bits of entropy per element. I suppose the encoding might actually also make sense for high-entropy integer data but I am not super sure. For low-entropy data, the dictionary encoding is good though I suspect there can be room for performance improvements. This is my final report for the encoding here: https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf Note that at some point my investigation turned out be quite the same solution as the one in https://github.com/powturbo/Turbo-Transpose. Maybe the points I sent can be helpful. Kinds regards, Martin From: evan_c...@apple.com on behalf of Evan Chan Sent: Tuesday, March 10, 2020 5:15:48 AM To: dev@arrow.apache.org Subject: Summary of RLE and other compression efforts? Hi folks, I’m curious about the state of efforts for more compressed encodings in the Arrow columnar format. I saw discussions previously about RLE, but is there a place to summarize all of the different efforts that are ongoing to bring more compressed encodings? Is there an effort to compress floating point or integer data using techniques such as XOR compression and Delta-Delta? I can contribute to some of these efforts as well. Thanks, Evan
[jira] [Created] (ARROW-8056) [R] Support read and write orc file format
Dyfan Jones created ARROW-8056: -- Summary: [R] Support read and write orc file format Key: ARROW-8056 URL: https://issues.apache.org/jira/browse/ARROW-8056 Project: Apache Arrow Issue Type: New Feature Reporter: Dyfan Jones Currently the R package can read/write arrow, feather, parquet etc ... How feasible is it for orc file format to be support with read / write capabilities? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8057) Schema equality not roundtrip safe
Florian Jetter created ARROW-8057: - Summary: Schema equality not roundtrip safe Key: ARROW-8057 URL: https://issues.apache.org/jira/browse/ARROW-8057 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Florian Jetter When performing schema roundtrips, the equality check for fields break. This is a regression from PyArrow 0.16.0 The equality check for entire schemas has never worked (but should from my POV) {code:python} import pyarrow.parquet as pq import pyarrow as pa print(pa.__version__) fields = [ pa.field("bool", pa.bool_()), pa.field("byte", pa.binary()), pa.field("date", pa.date32()), pa.field("datetime64", pa.timestamp("us")), pa.field("float32", pa.float64()), pa.field("float64", pa.float64()), pa.field("int16", pa.int64()), pa.field("int32", pa.int64()), pa.field("int64", pa.int64()), pa.field("int8", pa.int64()), pa.field("null", pa.null()), pa.field("uint16", pa.uint64()), pa.field("uint32", pa.uint64()), pa.field("uint64", pa.uint64()), pa.field("uint8", pa.uint64()), pa.field("unicode", pa.string()), pa.field("array_float32", pa.list_(pa.float64())), pa.field("array_float64", pa.list_(pa.float64())), pa.field("array_int16", pa.list_(pa.int64())), pa.field("array_int32", pa.list_(pa.int64())), pa.field("array_int64", pa.list_(pa.int64())), pa.field("array_int8", pa.list_(pa.int64())), pa.field("array_uint16", pa.list_(pa.uint64())), pa.field("array_uint32", pa.list_(pa.uint64())), pa.field("array_uint64", pa.list_(pa.uint64())), pa.field("array_uint8", pa.list_(pa.uint64())), pa.field("array_unicode", pa.list_(pa.string())), ] schema = pa.schema(fields) buf = pa.BufferOutputStream() pq.write_metadata(schema, buf) reader = pa.BufferReader(buf.getvalue().to_pybytes()) reconstructed_schema = pq.read_schema(reader) assert reconstructed_schema == reconstructed_schema assert reconstructed_schema[0] == reconstructed_schema[0] # This breaks on master / regression from 0.16.0 assert schema[0] == reconstructed_schema[0] # This never worked but should assert reconstructed_schema == schema assert schema == reconstructed_schema {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[DISCUSS][Java] Support non-nullable vectors
Dear all, A non-nullable vector is one that is guaranteed to contain no nulls. We want to support non-nullable vectors in Java. *Motivations:* 1. It is widely used in practice. For example, in a database engine, a column can be declared as not null, so it cannot contain null values. 2.Non-nullable vectors has significant performance advantages compared with their nullable conterparts, such as: 1) the memory space of the validity buffer can be saved. 2) manipulation of the validity buffer can be bypassed 3) some if-else branches can be replaced by sequential instructions (by the JIT compiler), leading to high throughput for the CPU pipeline. *Potential Cost:* For nullable vectors, there can be extra checks against the nullablility flag. So we must change the code in a way that minimizes the cost. *Proposed Changes:* 1. There is no need to create new vector classes. We add a final boolean to the vector base classes as the nullability flag. The value of the flag can be obtained from the field when creating the vector. 2. Add a method "boolean isNullable()" to the root interface ValueVector. 3. If a vector is non-nullable, its validity buffer should be an empty buffer (not null, so much of the existing logic can be left unchanged). 4. For operations involving validity buffers (e.g. isNull, get, set), we use the nullability flag to bypass manipulations to the validity buffer. Therefore, it should be possible to support the feature with small code changes. BTW, please note that similar behaviors have already been supported in C++. Would you please give your valueable feedback? Best, Liya Fan
[jira] [Created] (ARROW-8058) [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions
Ben Kietzman created ARROW-8058: --- Summary: [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions Key: ARROW-8058 URL: https://issues.apache.org/jira/browse/ARROW-8058 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 This can be costly and is not always necessary. At the same time we could move file validation into the scan tasks; currently all files are inspected as the dataset is constructed, which can be expensive if the filesystem is slow. We'll be performing the validation multiple times but the check will be cheap since at scan time we'll be reading the file into memory anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS][Java] Support non-nullable vectors
hi Liya, In C++ we elect certain faster code paths when the null count is 0 or computed to be zero. When the null count is 0, we do not allocate a validity bitmap. And there is a "nullable" metadata-only flag at the Field level. Could the same kinds of optimizations be implemented in Java without introducing a "nullable" concept? - Wes On Tue, Mar 10, 2020 at 8:13 AM Fan Liya wrote: > > Dear all, > > A non-nullable vector is one that is guaranteed to contain no nulls. We > want to support non-nullable vectors in Java. > > *Motivations:* > 1. It is widely used in practice. For example, in a database engine, a > column can be declared as not null, so it cannot contain null values. > 2.Non-nullable vectors has significant performance advantages compared with > their nullable conterparts, such as: > 1) the memory space of the validity buffer can be saved. > 2) manipulation of the validity buffer can be bypassed > 3) some if-else branches can be replaced by sequential instructions (by > the JIT compiler), leading to high throughput for the CPU pipeline. > > *Potential Cost:* > For nullable vectors, there can be extra checks against the nullablility > flag. So we must change the code in a way that minimizes the cost. > > *Proposed Changes:* > 1. There is no need to create new vector classes. We add a final boolean to > the vector base classes as the nullability flag. The value of the flag can > be obtained from the field when creating the vector. > 2. Add a method "boolean isNullable()" to the root interface ValueVector. > 3. If a vector is non-nullable, its validity buffer should be an empty > buffer (not null, so much of the existing logic can be left unchanged). > 4. For operations involving validity buffers (e.g. isNull, get, set), we > use the nullability flag to bypass manipulations to the validity buffer. > > Therefore, it should be possible to support the feature with small code > changes. > > BTW, please note that similar behaviors have already been supported in C++. > > Would you please give your valueable feedback? > > Best, > Liya Fan
Re: [DISCUSS][Java] Support non-nullable vectors
Hi Wes, Thanks a lot for your quick reply. I think what you mentioned is almost exactly what we want to do in Java.The concept is not important. Maybe there are only some minor differences: 1. In C++, the null_count is mutable, while for Java, once a vector is constructed as non-nullable, its null count can only be 0. 2. In C++, a non-nullable array's validity buffer is null, while in Java, the buffer is an empty buffer, and cannot be changed. Best, Liya Fan On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney wrote: > hi Liya, > > In C++ we elect certain faster code paths when the null count is 0 or > computed to be zero. When the null count is 0, we do not allocate a > validity bitmap. And there is a "nullable" metadata-only flag at the > Field level. Could the same kinds of optimizations be implemented in > Java without introducing a "nullable" concept? > > - Wes > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya wrote: > > > > Dear all, > > > > A non-nullable vector is one that is guaranteed to contain no nulls. We > > want to support non-nullable vectors in Java. > > > > *Motivations:* > > 1. It is widely used in practice. For example, in a database engine, a > > column can be declared as not null, so it cannot contain null values. > > 2.Non-nullable vectors has significant performance advantages compared > with > > their nullable conterparts, such as: > > 1) the memory space of the validity buffer can be saved. > > 2) manipulation of the validity buffer can be bypassed > > 3) some if-else branches can be replaced by sequential instructions (by > > the JIT compiler), leading to high throughput for the CPU pipeline. > > > > *Potential Cost:* > > For nullable vectors, there can be extra checks against the nullablility > > flag. So we must change the code in a way that minimizes the cost. > > > > *Proposed Changes:* > > 1. There is no need to create new vector classes. We add a final boolean > to > > the vector base classes as the nullability flag. The value of the flag > can > > be obtained from the field when creating the vector. > > 2. Add a method "boolean isNullable()" to the root interface ValueVector. > > 3. If a vector is non-nullable, its validity buffer should be an empty > > buffer (not null, so much of the existing logic can be left unchanged). > > 4. For operations involving validity buffers (e.g. isNull, get, set), we > > use the nullability flag to bypass manipulations to the validity buffer. > > > > Therefore, it should be possible to support the feature with small code > > changes. > > > > BTW, please note that similar behaviors have already been supported in > C++. > > > > Would you please give your valueable feedback? > > > > Best, > > Liya Fan >
Re: Making a patch 0.16.1 Arrow release
It seems like the consensus is to push for a 0.17.0 major release sooner rather than doing a patch release, since releases in general are costly. This is fine with me. I see that a 0.17.0 milestone has been created in JIRA and some JIRA gardening has begun. Do you think we can be in a position to release by the week of March 23 or the week of March 30? On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney wrote: > > If people are generally on board with accelerating a 0.17.0 major > release, then I would suggest renaming "1.0.0" to "0.17.0" and > beginning to do issue gardening to whittle things down to > critical-looking bugs and high probability patches for the next couple > of weeks. > > On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney wrote: > > > > I recall there are some other issues that have been reported or fixed > > that are critical and not yet marked with 0.16.1. > > > > I'm also OK with doing a 0.17.0 release sooner > > > > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson > > wrote: > > > > > > I would also be more supportive of doing 0.17 earlier instead of a patch > > > release. > > > > > > Neal > > > > > > > > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson > > > > > > wrote: > > > > > > > If releases were costless to make, I'd be all for it, but it's not clear > > > > to me that it's worth the diversion from other priorities to make a > > > > release > > > > right now. Nothing on > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1 > > > > jumps out to me as super urgent--what are you seeing as critical? > > > > > > > > If we did decide to go forward, would it be possible to do a release > > > > that > > > > is limited to the affected implementations (say, do a Python-only > > > > release)? > > > > That might reduce the cost of building and verifying enough to make it > > > > reasonable to consider. > > > > > > > > Neal > > > > > > > > > > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs > > > > > > > > wrote: > > > > > > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney > > > >> wrote: > > > >> > > > > >> > hi folks, > > > >> > > > > >> > There have been a number of critical issues reported (many of them > > > >> > fixed already) since 0.16.0 was released. Is there interest in > > > >> > preparing a patch 0.16.1 release (with backported patches onto a > > > >> > maint-0.16.x branch as with 0.15.1) since the next major release is a > > > >> > minimum of 6-8 weeks away from general availability? > > > >> > > > > >> > Did the 0.15.1 patch release helper script that Krisztian wrote get > > > >> > contributed as a PR? > > > >> Not yet, but it is available at > > > >> https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0 > > > >> > > > > >> > Thanks > > > >> > Wes > > > >> > > > >
Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0
Unsubscribe -Don On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira) wrote: > Wes McKinney created ARROW-8049: > --- > > Summary: [C++] Upgrade bundled Thrift version to 0.13.0 > Key: ARROW-8049 > URL: https://issues.apache.org/jira/browse/ARROW-8049 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Fix For: 0.17.0 > > > Follow up to discussion in ARROW-6821 > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) >
[jira] [Created] (ARROW-8059) [Python] Make FileSystem objects serializable
Joris Van den Bossche created ARROW-8059: Summary: [Python] Make FileSystem objects serializable Key: ARROW-8059 URL: https://issues.apache.org/jira/browse/ARROW-8059 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche It would be good to be able to pickle {{pyarrow.fs.FileSystem}} objects (eg for use in dask.distributed) cc [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8060) [Python] Make dataset Expression objects serializable
Joris Van den Bossche created ARROW-8060: Summary: [Python] Make dataset Expression objects serializable Key: ARROW-8060 URL: https://issues.apache.org/jira/browse/ARROW-8060 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche It would be good to be able to pickle pyarrow.dataset.Expression objects (eg for use in dask.distributed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)
Joris Van den Bossche created ARROW-8061: Summary: [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups) Key: ARROW-8061 URL: https://issues.apache.org/jira/browse/ARROW-8061 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Joris Van den Bossche Specifically for parquet (not sure if it will be relevant for other file formats as well, for IPC/feather potentially ther record batch), it would be useful to target row groups instead of files as fragments. Quoting the original design documents: _"In datasets consisting of many fragments, the dataset API must expose the granularity of fragments in a public way to enable parallel processing, if desired. "._ And a comment from Wes on that: _"a single Parquet file can "export" one or more fragments based on settings. The default might be to split fragments based on row group"_ Currently, the level on which fragments are defined (at least in the typical partitioned parquet dataset) is "1 file == 1 fragment". Would it be possible or desirable to make this more fine grained, where you could also opt to have a fragment per row group? We could have a ParquetFragment that has this option, and a ParquetFileFormat specific option to say what the granularity of a fragment is (file vs row group)? cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
Joris Van den Bossche created ARROW-8062: Summary: [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file Key: ARROW-8062 URL: https://issues.apache.org/jira/browse/ARROW-8062 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Partitioned parquet datasets sometimes come with {{_metadata}} / {{_common_metadata}} files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for {{_metadata}}). Using those files during the creation of a parquet {{Dataset}} can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory). Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed. Such logic could be put in a different factory class, eg {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8063) [Python] Add user guide documentation for Datasets API
Joris Van den Bossche created ARROW-8063: Summary: [Python] Add user guide documentation for Datasets API Key: ARROW-8063 URL: https://issues.apache.org/jira/browse/ARROW-8063 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Fix For: 0.17.0 Currently, we only have API docs (https://arrow.apache.org/docs/python/api/dataset.html), but we also need prose docs explaining what the dataset module does with examples. This can also include guidelines on how to use this instead of the ParquetDataset API (depending on how we end up doing ARROW-8039), this aspect is also covered by ARROW-8047 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Rust] Dictionary encoding for strings?
I believe that dictionary encoding in-memory was very recently implemented (February 28) in https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab. Not sure about the other questions On Mon, Mar 9, 2020 at 11:07 PM Evan Chan wrote: > > Hi, > > Does the Rust implementation support dictionary encoded strings? It is not > in the documentation anywhere, but there seem to be some variable-sized > dictionary structs in the code base. > If not, is there a plan to support it? > Does DataFusion support reading from dictionary strings? > > It seems all the examples in DataFusion and the Rust part are focused on > numbers. How robust is the string support, and how robust is the string > functionality overall? > > Thanks, > Evan
[jira] [Created] (ARROW-8064) [Dev] Implement Comment bot via Github actions
Krisztian Szucs created ARROW-8064: -- Summary: [Dev] Implement Comment bot via Github actions Key: ARROW-8064 URL: https://issues.apache.org/jira/browse/ARROW-8064 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Ala {{@ursabot}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
Francois Saint-Jacques created ARROW-8065: - Summary: [C++][Dataset] Untangle Dataset, Fragment and ScanOptions Key: ARROW-8065 URL: https://issues.apache.org/jira/browse/ARROW-8065 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques We should be able to list fragments without going through the Scanner/ScanOptions hoops. This exposes a flaw with the current API where it require a ScanOptions to create Fragment, this is also a problem for ARROW-7824, i.e. why do we need a ScanOptions (read manifest) to write record batches to a given path. # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}, if required, we can still provide an alternate signature, e.g. {{Dataset::GetFragments(std::shared_ptr predicate)}} for sub-tree pruning in FileSystemDataset. # Fragment constructor should take a schema (and store it as a property), usually extracted from the Dataset schema. Update the schema() method accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8066) PyArrow discards timezones
Markovtsev Vadim created ARROW-8066: --- Summary: PyArrow discards timezones Key: ARROW-8066 URL: https://issues.apache.org/jira/browse/ARROW-8066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Markovtsev Vadim The description is at [https://github.com/pandas-dev/pandas/issues/32587] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8067) [Python] FindPython3 fails on Python 3.5
Wes McKinney created ARROW-8067: --- Summary: [Python] FindPython3 fails on Python 3.5 Key: ARROW-8067 URL: https://issues.apache.org/jira/browse/ARROW-8067 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.17.0 {code} -- Could NOT find Backtrace (missing: Backtrace_LIBRARY Backtrace_INCLUDE_DIR) -- Found PythonInterp: C:/Miniconda/python.exe (found version "3.7.4") -- Found PythonLibs: C:/Miniconda/libs/Python37.lib CMake Error at cmake_modules/FindNumPy.cmake:58 (message): NumPy import failure: Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'numpy' Call Stack (most recent call first): cmake_modules/FindPython3Alt.cmake:31 (find_package) src/arrow/python/CMakeLists.txt:22 (find_package) -- Configuring incomplete, errors occurred! See also "C:/Users/wesmc/code/arrow/cpp/build/CMakeFiles/CMakeOutput.log". See also "C:/Users/wesmc/code/arrow/cpp/build/CMakeFiles/CMakeError.log". {code} This appears to work in 0.16.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Summary of RLE and other compression efforts?
See this past mailing list thread https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E and associated PR https://github.com/apache/arrow/pull/4815 There hasn't been a lot of movement on this but primarily because all the key people who've expressed interest in it have been really busy with other matters (myself included). Have RLE-encoding in memory at minimum would be a huge benefit for a number of applications, so it would be great to continue the discussion and create a more comprehensive proposal document describing what we would like to implement (and what we do not want to implement) On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin wrote: > > Hey Evan, > > > thank you for the interest. > > There has been some effort for compressing floating-point data on the Parquet > side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress > floating point data but makes it more compressible for when a compressor, > such as ZSTD, LZ4, etc, is used. It only works well for high-entropy > floating-point data, somewhere at least as large as >= 15 bits of entropy per > element. I suppose the encoding might actually also make sense for > high-entropy integer data but I am not super sure. > For low-entropy data, the dictionary encoding is good though I suspect there > can be room for performance improvements. > This is my final report for the encoding here: > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > Note that at some point my investigation turned out be quite the same > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > Maybe the points I sent can be helpful. > > > Kinds regards, > > Martin > > > From: evan_c...@apple.com on behalf of Evan Chan > > Sent: Tuesday, March 10, 2020 5:15:48 AM > To: dev@arrow.apache.org > Subject: Summary of RLE and other compression efforts? > > Hi folks, > > I’m curious about the state of efforts for more compressed encodings in the > Arrow columnar format. I saw discussions previously about RLE, but is there > a place to summarize all of the different efforts that are ongoing to bring > more compressed encodings? > > Is there an effort to compress floating point or integer data using > techniques such as XOR compression and Delta-Delta? I can contribute to some > of these efforts as well. > > Thanks, > Evan > >
Re: Summary of RLE and other compression efforts?
Martin, Many thanks for the links. My main concern is not actually FP and integer data, but sparse string data. Having many very sparse arrays, each with a bitmap and values (assume dictionary also), would be really expensive. I have some ideas I’d like to throw out there, around something like a MapArray (Think of it essentially as dictionaries of keys and values, plus List> for example), but something optimized for sparseness. Overall, while I appreciate the design of Arrow arrays to be super fast for computation, I’d like to be able to keep more of such data in memory, thus I’m interested in more compact representations, that ideally don’t need a compressor. More like encoding. I saw a thread middle of last year about RLE encodings, this would be in the right direction I think. It could be implemented properly such that it doesn’t make random access that bad. As for FP, I have my own scheme which is scale tested, SIMD friendly and should perform relatively well, and can fit in with different predictors including XOR, DFCM, etc. Due to the high cardinality of most such data, I wonder if it wouldn’t be simpler to stick with one such scheme for all FP data. Anyways, I’m most curious about if there is a plan to implement RLE, the FP schemes you describe, etc. and bring them to Arrow. IE, is there a plan for space efficient encodings overall for Arrow? Thanks very much, Evan > On Mar 10, 2020, at 1:41 AM, Radev, Martin wrote: > > Hey Evan, > > > thank you for the interest. > > There has been some effort for compressing floating-point data on the Parquet > side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress > floating point data but makes it more compressible for when a compressor, > such as ZSTD, LZ4, etc, is used. It only works well for high-entropy > floating-point data, somewhere at least as large as >= 15 bits of entropy per > element. I suppose the encoding might actually also make sense for > high-entropy integer data but I am not super sure. > For low-entropy data, the dictionary encoding is good though I suspect there > can be room for performance improvements. > This is my final report for the encoding here: > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > Note that at some point my investigation turned out be quite the same > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > Maybe the points I sent can be helpful. > > > Kinds regards, > > Martin > > > From: evan_c...@apple.com on behalf of Evan Chan > > Sent: Tuesday, March 10, 2020 5:15:48 AM > To: dev@arrow.apache.org > Subject: Summary of RLE and other compression efforts? > > Hi folks, > > I’m curious about the state of efforts for more compressed encodings in the > Arrow columnar format. I saw discussions previously about RLE, but is there > a place to summarize all of the different efforts that are ongoing to bring > more compressed encodings? > > Is there an effort to compress floating point or integer data using > techniques such as XOR compression and Delta-Delta? I can contribute to some > of these efforts as well. > > Thanks, > Evan > >
Re: Summary of RLE and other compression efforts?
Thank you Wes. If the stars line up I’d be interested in joining and contributing to this effort. I have a ton of ideas around efficient encodings for different types of data. > On Mar 10, 2020, at 2:52 PM, Wes McKinney wrote: > > See this past mailing list thread > > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E > > and associated PR > > https://github.com/apache/arrow/pull/4815 > > There hasn't been a lot of movement on this but primarily because all > the key people who've expressed interest in it have been really busy > with other matters (myself included). Have RLE-encoding in memory at > minimum would be a huge benefit for a number of applications, so it > would be great to continue the discussion and create a more > comprehensive proposal document describing what we would like to > implement (and what we do not want to implement) > > On Tue, Mar 10, 2020 at 3:41 AM Radev, Martin wrote: >> >> Hey Evan, >> >> >> thank you for the interest. >> >> There has been some effort for compressing floating-point data on the >> Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not >> compress floating point data but makes it more compressible for when a >> compressor, such as ZSTD, LZ4, etc, is used. It only works well for >> high-entropy floating-point data, somewhere at least as large as >= 15 bits >> of entropy per element. I suppose the encoding might actually also make >> sense for high-entropy integer data but I am not super sure. >> For low-entropy data, the dictionary encoding is good though I suspect there >> can be room for performance improvements. >> This is my final report for the encoding here: >> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf >> >> Note that at some point my investigation turned out be quite the same >> solution as the one in https://github.com/powturbo/Turbo-Transpose. >> >> >> Maybe the points I sent can be helpful. >> >> >> Kinds regards, >> >> Martin >> >> >> From: evan_c...@apple.com on behalf of Evan Chan >> >> Sent: Tuesday, March 10, 2020 5:15:48 AM >> To: dev@arrow.apache.org >> Subject: Summary of RLE and other compression efforts? >> >> Hi folks, >> >> I’m curious about the state of efforts for more compressed encodings in the >> Arrow columnar format. I saw discussions previously about RLE, but is there >> a place to summarize all of the different efforts that are ongoing to bring >> more compressed encodings? >> >> Is there an effort to compress floating point or integer data using >> techniques such as XOR compression and Delta-Delta? I can contribute to >> some of these efforts as well. >> >> Thanks, >> Evan >> >>
Re: Summary of RLE and other compression efforts?
On Tue, Mar 10, 2020 at 5:01 PM Evan Chan wrote: > > Martin, > > Many thanks for the links. > > My main concern is not actually FP and integer data, but sparse string data. > Having many very sparse arrays, each with a bitmap and values (assume > dictionary also), would be really expensive. I have some ideas I’d like to > throw out there, around something like a MapArray (Think of it essentially as > dictionaries of keys and values, plus List> for example), but > something optimized for sparseness. > > Overall, while I appreciate the design of Arrow arrays to be super fast for > computation, I’d like to be able to keep more of such data in memory, thus > I’m interested in more compact representations, that ideally don’t need a > compressor. More like encoding. > > I saw a thread middle of last year about RLE encodings, this would be in the > right direction I think. It could be implemented properly such that it > doesn’t make random access that bad. > > As for FP, I have my own scheme which is scale tested, SIMD friendly and > should perform relatively well, and can fit in with different predictors > including XOR, DFCM, etc. Due to the high cardinality of most such data, I > wonder if it wouldn’t be simpler to stick with one such scheme for all FP > data. > > Anyways, I’m most curious about if there is a plan to implement RLE, the FP > schemes you describe, etc. and bring them to Arrow. IE, is there a plan for > space efficient encodings overall for Arrow? It's been discussed many times in the past. As Arrow is developed by volunteers, if someone volunteers their time to work on it, it can happen. The first step would be to build consensus about what sort of protocol level additions (see the format/ directory and associated documentation) are needed. Once there is consensus about what to build and a complete specification for that, then implementation can move forward. > Thanks very much, > Evan > > > On Mar 10, 2020, at 1:41 AM, Radev, Martin wrote: > > > > Hey Evan, > > > > > > thank you for the interest. > > > > There has been some effort for compressing floating-point data on the > > Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not > > compress floating point data but makes it more compressible for when a > > compressor, such as ZSTD, LZ4, etc, is used. It only works well for > > high-entropy floating-point data, somewhere at least as large as >= 15 bits > > of entropy per element. I suppose the encoding might actually also make > > sense for high-entropy integer data but I am not super sure. > > For low-entropy data, the dictionary encoding is good though I suspect > > there can be room for performance improvements. > > This is my final report for the encoding here: > > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > > > Note that at some point my investigation turned out be quite the same > > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > > > > Maybe the points I sent can be helpful. > > > > > > Kinds regards, > > > > Martin > > > > > > From: evan_c...@apple.com on behalf of Evan Chan > > > > Sent: Tuesday, March 10, 2020 5:15:48 AM > > To: dev@arrow.apache.org > > Subject: Summary of RLE and other compression efforts? > > > > Hi folks, > > > > I’m curious about the state of efforts for more compressed encodings in the > > Arrow columnar format. I saw discussions previously about RLE, but is > > there a place to summarize all of the different efforts that are ongoing to > > bring more compressed encodings? > > > > Is there an effort to compress floating point or integer data using > > techniques such as XOR compression and Delta-Delta? I can contribute to > > some of these efforts as well. > > > > Thanks, > > Evan > > > > >
[jira] [Created] (ARROW-8068) [Python] Externalize option whether to bundle zlib DLL in Python packages
Wes McKinney created ARROW-8068: --- Summary: [Python] Externalize option whether to bundle zlib DLL in Python packages Key: ARROW-8068 URL: https://issues.apache.org/jira/browse/ARROW-8068 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney I ran into an esoteric situation in ARROW-8015 where I built the C++ library with all bundled dependencies including zlib. I then build a Python wheel, but the Python build failed when using {{PYARROW_BUNDLE_ARROW_CPP=1}} because it could not find {{zlib.dll}}. The failure points were both in CMakeLists.txt and in setup.py. Perhaps this situation will only arise esoterically, but we may want to add a flag to toggle off the zlib bundling behavior -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8069) [C++] Should the default value of "check_metadata" arguments of Equals methods be "true"?
Wes McKinney created ARROW-8069: --- Summary: [C++] Should the default value of "check_metadata" arguments of Equals methods be "true"? Key: ARROW-8069 URL: https://issues.apache.org/jira/browse/ARROW-8069 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We just changed the default in Python to False for usability reasons. Since C++ has different usability considerations, we don't necessarily need to have the default be the same, but I'm curious if anyone has any opinions one way or the other. I would be weakly supportive of changing the default to false -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0
Arrow Build Report for Job nightly-2020-03-10-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0 Failed Tasks: - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-8 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py38 - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-osx - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-homebrew-cpp - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp-valgrind - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-master - test-r-rhub-debian-gcc-devel: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-debian-gcc-devel - test-r-rhub-ubuntu-gcc-release: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-ubuntu-gcc-release - test-r-rstudio-r-base-3.6-opensuse15: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse15 - test-r-rstudio-r-base-3.6-opensuse42: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse42 - ubuntu-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-ubuntu-xenial - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-6 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py37 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-buster - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-202
[jira] [Created] (ARROW-8070) [Python] Casting Segfault
Daniel Nugent created ARROW-8070: Summary: [Python] Casting Segfault Key: ARROW-8070 URL: https://issues.apache.org/jira/browse/ARROW-8070 Project: Apache Arrow Issue Type: Bug Reporter: Daniel Nugent Was messing around with some nested arrays and found a pretty easy to reproduce segfault: {code:java} Python 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np, pyarrow as pa >>> pa.__version__ '0.16.0' >>> np.__version__ '1.18.1' >>> x=[np.array([b'a',b'b'])] >>> a = pa.array(x,pa.list_(pa.binary())) >>> a [ [ 61, 62 ] ] >>> a.cast(pa.string()) Segmentation fault {code} I don't know if that cast makes sense, but I left the checks on, so I would not expect a segfault from it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8071) [GLib] Build error with configure
Kouhei Sutou created ARROW-8071: --- Summary: [GLib] Build error with configure Key: ARROW-8071 URL: https://issues.apache.org/jira/browse/ARROW-8071 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou This is introduced by ARROW-8055. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0
Hi, Failures of Linux packages will be fixed by https://github.com/apache/arrow/pull/6575 . Sorry. Thanks, -- kou In <5e6834bf.1c69fb81.a268f.f...@mx.google.com> "[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0" on Tue, 10 Mar 2020 17:45:51 -0700 (PDT), Crossbow wrote: > > Arrow Build Report for Job nightly-2020-03-10-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0 > > Failed Tasks: > - centos-7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-7 > - centos-8: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-8 > - conda-linux-gcc-py38: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py38 > - debian-stretch: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-stretch > - gandiva-jar-osx: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-osx > - gandiva-jar-trusty: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-gandiva-jar-trusty > - homebrew-cpp: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-homebrew-cpp > - test-conda-cpp-valgrind: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-cpp-valgrind > - test-conda-python-3.7-pandas-master: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-pandas-master > - test-conda-python-3.7-turbodbc-latest: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-latest > - test-conda-python-3.7-turbodbc-master: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-circle-test-conda-python-3.7-turbodbc-master > - test-r-rhub-debian-gcc-devel: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-debian-gcc-devel > - test-r-rhub-ubuntu-gcc-release: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rhub-ubuntu-gcc-release > - test-r-rstudio-r-base-3.6-opensuse15: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse15 > - test-r-rstudio-r-base-3.6-opensuse42: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-test-r-rstudio-r-base-3.6-opensuse42 > - ubuntu-xenial: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-ubuntu-xenial > - wheel-osx-cp35m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp35m > - wheel-osx-cp36m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp36m > - wheel-osx-cp37m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp37m > - wheel-osx-cp38: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-wheel-osx-cp38 > > Succeeded Tasks: > - centos-6: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-6 > - conda-linux-gcc-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py36 > - conda-linux-gcc-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-linux-gcc-py37 > - conda-osx-clang-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py36 > - conda-osx-clang-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py37 > - conda-osx-clang-py38: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-osx-clang-py38 > - conda-win-vs2015-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py36 > - conda-win-vs2015-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py37 > - conda-win-vs2015-py38: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-azure-conda-win-vs2015-py38 > - debian-buster: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-debian-buster > - macos-r-autobrew: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-travis-macos-r-autobrew > - test-conda-cpp: > URL: > https://github.co
Re: Summary of RLE and other compression efforts?
+1 to what Wes said. I'm hoping to have some more time to spend on this end of Q2/beginning of Q3 if no progress is made by then. I still think we should be careful on what is added to the spec, in particular, we should be focused on encodings that can be used to improve computational efficiency rather than just smaller size. Also, it is important to note that any sort of encoding/compression must be supportable across multiple languages/platforms. Thanks, Micah On Tue, Mar 10, 2020 at 3:12 PM Wes McKinney wrote: > On Tue, Mar 10, 2020 at 5:01 PM Evan Chan > wrote: > > > > Martin, > > > > Many thanks for the links. > > > > My main concern is not actually FP and integer data, but sparse string > data. Having many very sparse arrays, each with a bitmap and values > (assume dictionary also), would be really expensive. I have some ideas I’d > like to throw out there, around something like a MapArray (Think of it > essentially as dictionaries of keys and values, plus List> for > example), but something optimized for sparseness. > > > > Overall, while I appreciate the design of Arrow arrays to be super fast > for computation, I’d like to be able to keep more of such data in memory, > thus I’m interested in more compact representations, that ideally don’t > need a compressor. More like encoding. > > > > I saw a thread middle of last year about RLE encodings, this would be in > the right direction I think. It could be implemented properly such that > it doesn’t make random access that bad. > > > > As for FP, I have my own scheme which is scale tested, SIMD friendly and > should perform relatively well, and can fit in with different predictors > including XOR, DFCM, etc. Due to the high cardinality of most such data, > I wonder if it wouldn’t be simpler to stick with one such scheme for all FP > data. > > > > Anyways, I’m most curious about if there is a plan to implement RLE, the > FP schemes you describe, etc. and bring them to Arrow. IE, is there a plan > for space efficient encodings overall for Arrow? > > It's been discussed many times in the past. As Arrow is developed by > volunteers, if someone volunteers their time to work on it, it can > happen. The first step would be to build consensus about what sort of > protocol level additions (see the format/ directory and associated > documentation) are needed. Once there is consensus about what to build > and a complete specification for that, then implementation can move > forward. > > > Thanks very much, > > Evan > > > > > On Mar 10, 2020, at 1:41 AM, Radev, Martin > wrote: > > > > > > Hey Evan, > > > > > > > > > thank you for the interest. > > > > > > There has been some effort for compressing floating-point data on the > Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not > compress floating point data but makes it more compressible for when a > compressor, such as ZSTD, LZ4, etc, is used. It only works well for > high-entropy floating-point data, somewhere at least as large as >= 15 bits > of entropy per element. I suppose the encoding might actually also make > sense for high-entropy integer data but I am not super sure. > > > For low-entropy data, the dictionary encoding is good though I suspect > there can be room for performance improvements. > > > This is my final report for the encoding here: > https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf > > > > > > Note that at some point my investigation turned out be quite the same > solution as the one in https://github.com/powturbo/Turbo-Transpose. > > > > > > > > > Maybe the points I sent can be helpful. > > > > > > > > > Kinds regards, > > > > > > Martin > > > > > > > > > From: evan_c...@apple.com on behalf of Evan > Chan > > > Sent: Tuesday, March 10, 2020 5:15:48 AM > > > To: dev@arrow.apache.org > > > Subject: Summary of RLE and other compression efforts? > > > > > > Hi folks, > > > > > > I’m curious about the state of efforts for more compressed encodings > in the Arrow columnar format. I saw discussions previously about RLE, but > is there a place to summarize all of the different efforts that are ongoing > to bring more compressed encodings? > > > > > > Is there an effort to compress floating point or integer data using > techniques such as XOR compression and Delta-Delta? I can contribute to > some of these efforts as well. > > > > > > Thanks, > > > Evan > > > > > > > > >
Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0
Hi Don, I believe you send an e-mail to dev-unsubscr...@arrow.apache.org instead of simply replying to the list. Thanks, Micah On Tue, Mar 10, 2020 at 8:57 AM Don Hilborn wrote: > Unsubscribe > > > -Don > > > On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira) > wrote: > > > Wes McKinney created ARROW-8049: > > --- > > > > Summary: [C++] Upgrade bundled Thrift version to 0.13.0 > > Key: ARROW-8049 > > URL: https://issues.apache.org/jira/browse/ARROW-8049 > > Project: Apache Arrow > > Issue Type: Improvement > > Components: C++ > > Reporter: Wes McKinney > > Fix For: 0.17.0 > > > > > > Follow up to discussion in ARROW-6821 > > > > > > > > -- > > This message was sent by Atlassian Jira > > (v8.3.4#803005) > > >
Re: [Java] Port vector validate functionality
I agree, it would also be good to run with some of the fuzzed IPC files. On Fri, Mar 6, 2020 at 6:20 AM Wes McKinney wrote: > Seems useful. It may be a good idea to run within integration tests as > an extra sanity check also > > On Fri, Mar 6, 2020 at 2:27 AM Ji Liu wrote: > > > > > > Hi all, > > In C++ side, we already have array validate functionality[1] but no > similar functionality in Java side. > > I was thinking if we should port this into Java implementation? Since we > already has visitor interface[2] and it seems not very complicated. I > created an issue to track this[3]. > > > > > > Thanks, > > Ji Liu > > > > [1] > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/validate.h > > [2] > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/java/vector/src/main/java/org/apache/arrow/vector/compare/VectorVisitor.java > > [3] https://issues.apache.org/jira/browse/ARROW-8020 >
Re: [Discuss] [Java] Implement vector diff functionality
I'm in favor of this. I think this can be combined with a custom matcher for Google's Truth [1] library, to make a lot of our unit tests much more readable [1] https://github.com/google/truth On Thu, Mar 5, 2020 at 11:29 PM Ji Liu wrote: > > Hi all, > In C++ side, we already have array diff functionality[1] for array equals > and testing to make it easy to see difference between arrays and reduce > debugging time. > I think it’s better to have similar functionality in Java side for better > testing facilities, and I created an issue to track this[2]. > > > Thanks, > Ji Liu > > [1] > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/diff.h > [2] https://issues.apache.org/jira/browse/ARROW-8019
Re: [DISCUSS][Java] Support non-nullable vectors
Hi Liya Fan, I'm a little concerned that this will change assumptions for at least some of the clients using the library (some might always rely on the validity buffer being present). I think this is a good feature to have for the reasons you mentioned. It seems like there would need to be some sort of configuration bit to set for this behavior. But, I'd be worried about code complexity this would introduce. Thanks, Micah On Tue, Mar 10, 2020 at 6:42 AM Fan Liya wrote: > Hi Wes, > > Thanks a lot for your quick reply. > I think what you mentioned is almost exactly what we want to do in Java.The > concept is not important. > > Maybe there are only some minor differences: > 1. In C++, the null_count is mutable, while for Java, once a vector is > constructed as non-nullable, its null count can only be 0. > 2. In C++, a non-nullable array's validity buffer is null, while in Java, > the buffer is an empty buffer, and cannot be changed. > > Best, > Liya Fan > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney wrote: > > > hi Liya, > > > > In C++ we elect certain faster code paths when the null count is 0 or > > computed to be zero. When the null count is 0, we do not allocate a > > validity bitmap. And there is a "nullable" metadata-only flag at the > > Field level. Could the same kinds of optimizations be implemented in > > Java without introducing a "nullable" concept? > > > > - Wes > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya wrote: > > > > > > Dear all, > > > > > > A non-nullable vector is one that is guaranteed to contain no nulls. We > > > want to support non-nullable vectors in Java. > > > > > > *Motivations:* > > > 1. It is widely used in practice. For example, in a database engine, a > > > column can be declared as not null, so it cannot contain null values. > > > 2.Non-nullable vectors has significant performance advantages compared > > with > > > their nullable conterparts, such as: > > > 1) the memory space of the validity buffer can be saved. > > > 2) manipulation of the validity buffer can be bypassed > > > 3) some if-else branches can be replaced by sequential instructions > (by > > > the JIT compiler), leading to high throughput for the CPU pipeline. > > > > > > *Potential Cost:* > > > For nullable vectors, there can be extra checks against the > nullablility > > > flag. So we must change the code in a way that minimizes the cost. > > > > > > *Proposed Changes:* > > > 1. There is no need to create new vector classes. We add a final > boolean > > to > > > the vector base classes as the nullability flag. The value of the flag > > can > > > be obtained from the field when creating the vector. > > > 2. Add a method "boolean isNullable()" to the root interface > ValueVector. > > > 3. If a vector is non-nullable, its validity buffer should be an empty > > > buffer (not null, so much of the existing logic can be left unchanged). > > > 4. For operations involving validity buffers (e.g. isNull, get, set), > we > > > use the nullability flag to bypass manipulations to the validity > buffer. > > > > > > Therefore, it should be possible to support the feature with small code > > > changes. > > > > > > BTW, please note that similar behaviors have already been supported in > > C++. > > > > > > Would you please give your valueable feedback? > > > > > > Best, > > > Liya Fan > > >
[jira] [Created] (ARROW-8072) Add const constraint when parsing data
Siyuan Zhuang created ARROW-8072: Summary: Add const constraint when parsing data Key: ARROW-8072 URL: https://issues.apache.org/jira/browse/ARROW-8072 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Siyuan Zhuang Assignee: Siyuan Zhuang Input data for plasma protocol.h/protocol.cc should be const. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8073) [GLib] Add binding of arrow::fs::PathForest
Kenta Murata created ARROW-8073: --- Summary: [GLib] Add binding of arrow::fs::PathForest Key: ARROW-8073 URL: https://issues.apache.org/jira/browse/ARROW-8073 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)