Re: Compression in Arrow - Question

2020-08-29 Thread Micah Kornfield
Hi Mark,
See the most recent previous discussion about alternate encodings [1].
This is something that in the long run should be added, I'd personally
prefer to start with simpler encodings.

I don't think we should add anything more with regard to
compression/encoding until at least 3 languages support the current
compression methods that are in the specification.  C++ has it implemented,
there is some work in Java and I think we should have at least one more.

-Micah

[1]
https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E

On Sat, Aug 29, 2020 at 4:04 PM  wrote:

>
> I was looking at compression in arrow had a couple questions.
>
> If I've understood compression currently,   it is only used  'in flight'
> in either IPC or Arrow Flight, using a block compression,  but still
> decoded into Ram at the destination in full array form.  Is this correct ?
>
>
> Given that arrow is a columnar format, has any thought been given to an
> option to have the data compressed both in memory and in flight, using some
> of the columnar techniques ?
>  As I deal primarily with Timeseries numerical data, I was thinking about
> some of the algorithms from the Gorilla paper [1]  for Floats  and
> Timestamps (Delta-of-Delta) or similar might be appropriate.
>
> The interface functions could  still iterate over the data and produce raw
> values so this is transparent to users of the data, but the data
> blocks/arrays in-mem are actually compressed.
>
> With this method, blocks could come out of a data base/source, through the
> data service, across the wire (flight)  and land in the consuming
> applications memory without ever being decompressed or processed until
> final use.
>
>
> Crazy thought ?
>
>
> Regards
>
> Mark.
>
>
> [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
>
>


Compression in Arrow - Question

2020-08-29 Thread mark


I was looking at compression in arrow had a couple questions. 

If I've understood compression currently,   it is only used  'in flight'  in 
either IPC or Arrow Flight, using a block compression,  but still decoded into 
Ram at the destination in full array form.  Is this correct ? 


Given that arrow is a columnar format, has any thought been given to an option 
to have the data compressed both in memory and in flight, using some of the 
columnar techniques ? 
 As I deal primarily with Timeseries numerical data, I was thinking about some 
of the algorithms from the Gorilla paper [1]  for Floats  and Timestamps 
(Delta-of-Delta) or similar might be appropriate. 

The interface functions could  still iterate over the data and produce raw 
values so this is transparent to users of the data, but the data blocks/arrays 
in-mem are actually compressed.  

With this method, blocks could come out of a data base/source, through the data 
service, across the wire (flight)  and land in the consuming applications 
memory without ever being decompressed or processed until final use. 


Crazy thought ?


Regards

Mark. 


[1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf



ORC writer

2020-08-29 Thread Ying Zhou
Hi,

I’m interested in writing a binder so that we can write ORC files in Arrow. I 
likely should contribute mostly to 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/adapters/orc 
 as 
well as editing the relevant Python/Cython files, right? Moreover I would like 
to ask whether there is any existing branch with partly finished work on ORC 
writers. Thanks!

Ying Zhou

[NIGHTLY] Arrow Build Report for Job nightly-2020-08-29-0

2020-08-29 Thread Crossbow


Arrow Build Report for Job nightly-2020-08-29-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0

Failed Tasks:
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-osx-clang-py38
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-test-conda-python-3.7-kartothek-master

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-linux-gcc-py38-cuda
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-debian-stretch-arm64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-travis-homebrew-r-autobrew
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-nuget
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-29-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
htt