Re: Summary of RLE and other compression efforts?

2020-03-16 Thread Micah Kornfield
Hi Evan,
Response inline.  This is just my opinion, others are welcome to chime in.
Apologies for the long reply.


> I’m wondering how important retaining strict O(1) random access is for
> Arrow’s use cases?  Would “substantially sub-linear” or “approximately
> O(c)” where c is a very small constant (a la B-Tree) work?


I don't think this has been formally discussed past the initial design
consideration for the format.  We should be careful of terminology,  RLE
only provides O(lg(n)) random access to a data-element.

In either case I would lean against constant factor additions to the format
and encoding proposal.  Some of the techniques brought up here might be
interesting to add to something like Parquet or create an adapter between
FiloDB data and Arrow.

I feel a little uncomfortable in the fact that there isn't a more clearly
defined dividing line for what belongs in Arrow and what doesn't.  I
suppose this is what discussions like these are about :)

In my mind Arrow has two primary goals:
1.  Wide accessibility
2.  Speed, primarily with the assumption that CPU time is the constraining
resource.  Some of the technologies in the Arrow ecosystem (e.g. Feather,
Flight) have blurred this line a little bit. In some cases these
technologies are dominated by other constraints as seen by Wes's
compression proposal [1].

Given these points, adding complexity and constant factors, at least for an
initial implementation, are something that needs to be looked at very
carefully.  I would likely change my mind if there was a demonstration that
the complexity adds substantial value across a variety of data
sets/workloads beyond simpler versions.

 I also have my own, SIMD-friendly, variable-length encoding which can
> compress both floating point and integers quickly.

Per above I'm worried about adding constant factors here.

Happy to give more details, perhaps in a separate channel if needed.


For Apache projects the mailing list is best place to discuss these things
(or a link to a design doc posted to the list). If a higher bandwidth
medium is needed there is a sync-up every other week on Wednesdays 9 AM
Pacific Time (there is usually a reminder sent out the night before) and
notes are posted back to the list.

-Micah


[1]
https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E


On Sat, Mar 14, 2020 at 10:23 PM Evan Chan  wrote:

> Micah,
>
> Thanks for the link to the PR.  I’ll add more comments about those
> specific proposals there, though maybe I should outline ideas here as I’m
> not sure the PR is the best place to expand on details.
>
> I’m wondering how important retaining strict O(1) random access is for
> Arrow’s use cases?  Would “substantially sub-linear” or “approximately
> O(c)” where c is a very small constant (a la B-Tree) work?
>
>
>
> I'm not sure if this is provided was provided as an example as something
> to add, but I believe this is already supported in the spec.
>
>
> In the longer term I think it is potentially something Arrow should
> include, but given the current number of developers contributing across all
> languages I'm wary of tacking on too much before we have working  reference
> implementations.
>
>
> I can understand trying to tackle less, just one new proposal and see it
> implemented widely first.
>
>
> If you think there are better encodings to tackle in the short term I'd
> welcome feedback on the proposal (or a more formal proposal of your own).
>
>
> An alternative to RLE with storing run-ends for every element, is to use
> RLE or variable-length encoding schemes, but combine them with an index or
> offsets for every Nth element.  This means you get O(c) random access where
> you have to decode no more than N elements at a time.  The N can be
> adjusted to tune for either better compression (larger N) or faster reads
> (less N to decode at a time).
>
> I also have my own, SIMD-friendly, variable-length encoding which can
> compress both floating point and integers quickly.   Some details here:
>
> https://github.com/filodb/FiloDB/blob/develop/doc/compression.md#predictive-nibblepacking
>
> Happy to give more details, perhaps in a separate channel if needed.
>
> -Evan
>
>
> Thanks,
> -Micah
>
> [1] https://github.com/apache/arrow/pull/4815/files
>
>
>
> On Wed, Mar 11, 2020 at 9:24 AM Evan Chan  wrote:
>
>> Sure thing.
>>
>> Computation speed needs to be thought about in context  We might find
>> something which takes up half the space to be a little more computationally
>> expensive, but in the grand scheme of things is faster to compute as more
>> of it can fit in memory, and it saves I/O.   I definitely agree that just
>> making something smaller without other gains might not be worth it.
>>
>> Some small changes help both computation and size.  For example, encoding
>> nulls in dictionary values helps reduce the need for both bitmap storage
>> and lookup.
>>
>> So I suppose the process goes 

[NIGHTLY] Arrow Build Report for Job nightly-2020-03-16-1

2020-03-16 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-16-1

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1

Failed Tasks:
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-cpp-valgrind
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-turbodbc-master
- test-debian-c-glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-debian-c-glib
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-github-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-16-1-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 

[jira] [Created] (ARROW-8134) [C++][CI] Revisit the flaky S3 tests caused by recent minio

2020-03-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8134:
--

 Summary: [C++][CI] Revisit the flaky S3 tests caused by recent 
minio 
 Key: ARROW-8134
 URL: https://issues.apache.org/jira/browse/ARROW-8134
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Krisztian Szucs


See change in 
https://github.com/apache/arrow/pull/6632/files#diff-01448dc0a0e3217fd1579b81a915d593R843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8133) [CI] Github Actions sometimes fail to checkout Arrow

2020-03-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8133:
--

 Summary: [CI] Github Actions sometimes fail to checkout Arrow
 Key: ARROW-8133
 URL: https://issues.apache.org/jira/browse/ARROW-8133
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Failing build 
https://github.com/apache/arrow/pull/6632/checks?check_run_id=511663097



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Hatem Helal (Jira)
Hatem Helal created ARROW-8132:
--

 Summary: [C++] arrow-s3fs-test failing on master
 Key: ARROW-8132
 URL: https://issues.apache.org/jira/browse/ARROW-8132
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Hatem Helal


Log:

[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-8131:
---

 Summary: Add dynamic attributes to PyArrow ExtensionArray
 Key: ARROW-8131
 URL: https://issues.apache.org/jira/browse/ARROW-8131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.16.0
 Environment: Ubuntu 19.10 + Python 3.7
Reporter: Paul Balanca


In the present implementation, the interface of the class `ExtensionArray` is 
not extendable by user. One can not easily inherit from it, as the constructor 
__init__ can not be called directly, or it does not allow adding dynamically 
atttributes.

Keeping the current design with build methods `from_*`, I believe it could then 
make sense to allow dynamic attributes in `ExtensionArray` (see 
[https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
 The runtime & size cost of the Python objects would be fairly minimal, 
compared to increased flexibility it would allow.

A typical use case where it could be useful would be dynamic mixins (added by 
custom Factory), allowing projects based on PyArrow to extend (! :)) the 
interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8130) [C++][Gandiva] Fix dexVisitor

2020-03-16 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8130:
---

 Summary: [C++][Gandiva] Fix dexVisitor
 Key: ARROW-8130
 URL: https://issues.apache.org/jira/browse/ARROW-8130
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8129:
---

 Summary: [C++][Compute] Refine compare sorting kernel
 Key: ARROW-8129
 URL: https://issues.apache.org/jira/browse/ARROW-8129
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yibo Cai
Assignee: Yibo Cai


Sorting kernel implements two comparison functions, 
[CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
 use array.Value() for numeric data and 
[CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
 uses array.GetView() for non-numeric ones. It can be simplified by using 
GetView() only as all data types support GetView().

To my surprise, benchmark shows about 40% performance improvement after the 
change.

After some digging, I find in current code, the [comparison 
callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
 is not inlined (check disassembled code), it leads to a function call. It's 
very bad for this hot loop. Using only GetView() fixes this issue, code inlined 
okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)