Re: Summary of RLE and other compression efforts?
Hi Evan, Response inline. This is just my opinion, others are welcome to chime in. Apologies for the long reply. > I’m wondering how important retaining strict O(1) random access is for > Arrow’s use cases? Would “substantially sub-linear” or “approximately > O(c)” where c is a very small constant (a la B-Tree) work? I don't think this has been formally discussed past the initial design consideration for the format. We should be careful of terminology, RLE only provides O(lg(n)) random access to a data-element. In either case I would lean against constant factor additions to the format and encoding proposal. Some of the techniques brought up here might be interesting to add to something like Parquet or create an adapter between FiloDB data and Arrow. I feel a little uncomfortable in the fact that there isn't a more clearly defined dividing line for what belongs in Arrow and what doesn't. I suppose this is what discussions like these are about :) In my mind Arrow has two primary goals: 1. Wide accessibility 2. Speed, primarily with the assumption that CPU time is the constraining resource. Some of the technologies in the Arrow ecosystem (e.g. Feather, Flight) have blurred this line a little bit. In some cases these technologies are dominated by other constraints as seen by Wes's compression proposal [1]. Given these points, adding complexity and constant factors, at least for an initial implementation, are something that needs to be looked at very carefully. I would likely change my mind if there was a demonstration that the complexity adds substantial value across a variety of data sets/workloads beyond simpler versions. I also have my own, SIMD-friendly, variable-length encoding which can > compress both floating point and integers quickly. Per above I'm worried about adding constant factors here. Happy to give more details, perhaps in a separate channel if needed. For Apache projects the mailing list is best place to discuss these things (or a link to a design doc posted to the list). If a higher bandwidth medium is needed there is a sync-up every other week on Wednesdays 9 AM Pacific Time (there is usually a reminder sent out the night before) and notes are posted back to the list. -Micah [1] https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E On Sat, Mar 14, 2020 at 10:23 PM Evan Chan wrote: > Micah, > > Thanks for the link to the PR. I’ll add more comments about those > specific proposals there, though maybe I should outline ideas here as I’m > not sure the PR is the best place to expand on details. > > I’m wondering how important retaining strict O(1) random access is for > Arrow’s use cases? Would “substantially sub-linear” or “approximately > O(c)” where c is a very small constant (a la B-Tree) work? > > > > I'm not sure if this is provided was provided as an example as something > to add, but I believe this is already supported in the spec. > > > In the longer term I think it is potentially something Arrow should > include, but given the current number of developers contributing across all > languages I'm wary of tacking on too much before we have working reference > implementations. > > > I can understand trying to tackle less, just one new proposal and see it > implemented widely first. > > > If you think there are better encodings to tackle in the short term I'd > welcome feedback on the proposal (or a more formal proposal of your own). > > > An alternative to RLE with storing run-ends for every element, is to use > RLE or variable-length encoding schemes, but combine them with an index or > offsets for every Nth element. This means you get O(c) random access where > you have to decode no more than N elements at a time. The N can be > adjusted to tune for either better compression (larger N) or faster reads > (less N to decode at a time). > > I also have my own, SIMD-friendly, variable-length encoding which can > compress both floating point and integers quickly. Some details here: > > https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking > > Happy to give more details, perhaps in a separate channel if needed. > > -Evan > > > Thanks, > -Micah > > [1] https://github.com/apache/arrow/pull/4815/files > > > > On Wed, Mar 11, 2020 at 9:24 AM Evan Chan wrote: > >> Sure thing. >> >> Computation speed needs to be thought about in context We might find >> something which takes up half the space to be a little more computationally >> expensive, but in the grand scheme of things is faster to compute as more >> of it can fit in memory, and it saves I/O. I definitely agree that just >> making something smaller without other gains might not be worth it. >> >> Some small changes help both computation and size. For example, encoding >> nulls in dictionary values helps reduce the need for both bitmap storage >> and lookup. >> >> So I suppose the process goes
[NIGHTLY] Arrow Build Report for Job nightly-2020-03-16-1
Arrow Build Report for Job nightly-2020-03-16-1 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1 Failed Tasks: - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-gandiva-jar-trusty - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-cpp-valgrind - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-turbodbc-master - test-debian-c-glib: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-debian-c-glib - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-8 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-gandiva-jar-osx - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL:
[jira] [Created] (ARROW-8134) [C++][CI] Revisit the flaky S3 tests caused by recent minio
Krisztian Szucs created ARROW-8134: -- Summary: [C++][CI] Revisit the flaky S3 tests caused by recent minio Key: ARROW-8134 URL: https://issues.apache.org/jira/browse/ARROW-8134 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration Reporter: Krisztian Szucs See change in https://github.com/apache/arrow/pull/6632/files#diff-01448dc0a0e3217fd1579b81a915d593R843 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8133) [CI] Github Actions sometimes fail to checkout Arrow
Krisztian Szucs created ARROW-8133: -- Summary: [CI] Github Actions sometimes fail to checkout Arrow Key: ARROW-8133 URL: https://issues.apache.org/jira/browse/ARROW-8133 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs Failing build https://github.com/apache/arrow/pull/6632/checks?check_run_id=511663097 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8132) [C++] arrow-s3fs-test failing on master
Hatem Helal created ARROW-8132: -- Summary: [C++] arrow-s3fs-test failing on master Key: ARROW-8132 URL: https://issues.apache.org/jira/browse/ARROW-8132 Project: Apache Arrow Issue Type: Improvement Reporter: Hatem Helal Log: [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray
Paul Balanca created ARROW-8131: --- Summary: Add dynamic attributes to PyArrow ExtensionArray Key: ARROW-8131 URL: https://issues.apache.org/jira/browse/ARROW-8131 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.16.0 Environment: Ubuntu 19.10 + Python 3.7 Reporter: Paul Balanca In the present implementation, the interface of the class `ExtensionArray` is not extendable by user. One can not easily inherit from it, as the constructor __init__ can not be called directly, or it does not allow adding dynamically atttributes. Keeping the current design with build methods `from_*`, I believe it could then make sense to allow dynamic attributes in `ExtensionArray` (see [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). The runtime & size cost of the Python objects would be fairly minimal, compared to increased flexibility it would allow. A typical use case where it could be useful would be dynamic mixins (added by custom Factory), allowing projects based on PyArrow to extend (! :)) the interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8130) [C++][Gandiva] Fix dexVisitor
Prudhvi Porandla created ARROW-8130: --- Summary: [C++][Gandiva] Fix dexVisitor Key: ARROW-8130 URL: https://issues.apache.org/jira/browse/ARROW-8130 Project: Apache Arrow Issue Type: Task Reporter: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8129) [C++][Compute] Refine compare sorting kernel
Yibo Cai created ARROW-8129: --- Summary: [C++][Compute] Refine compare sorting kernel Key: ARROW-8129 URL: https://issues.apache.org/jira/browse/ARROW-8129 Project: Apache Arrow Issue Type: Improvement Reporter: Yibo Cai Assignee: Yibo Cai Sorting kernel implements two comparison functions, [CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67] use array.Value() for numeric data and [CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72] uses array.GetView() for non-numeric ones. It can be simplified by using GetView() only as all data types support GetView(). To my surprise, benchmark shows about 40% performance improvement after the change. After some digging, I find in current code, the [comparison callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94] is not inlined (check disassembled code), it leads to a function call. It's very bad for this hot loop. Using only GetView() fixes this issue, code inlined okay. -- This message was sent by Atlassian Jira (v8.3.4#803005)