Write a parquet file with delta encoding enable

2020-03-23 Thread Omega Gamage
I was trying to write a parquet file with delta encoding. This page
, states
that parquet supports three types of delta encodings:

(DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).

Since spark, pyspark or pyarrow does not allow us to specify the encoding
method. I was curious how one can write a file with delta encoding enabled?

However, I found on the internet that, if I have columns with TimeStamp
type parquet will use delta encoding. So I used the following code in
*Scala* to create a parquet file. But encoding is not a delta.


val df = Seq(("2018-05-01"),
("2018-05-02"),
("2018-05-03"),
("2018-05-04"),
("2018-05-05"),
("2018-05-06"),
("2018-05-07"),
("2018-05-08"),
("2018-05-09"),
("2018-05-10")
).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))

df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")

parquet-tools shows the following information regarding the written parquet
file.

file schema: spark_schema
Id:
 OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
R:0 D:1Date:OPTIONAL INT32 L:DATE R:0 D:1

row group 1: RC:31 TS:1100 OFFSET:4
Id:
  BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]Timestamp:INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
not defined]Date: INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]

As you can see, no column has used delta encoding.

My question is,

1) How can I write a parquet file with delta encoding? (If you can provide
an example code in scala or python that would be great.) 2) How to decide
which "delta encoding": (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY,
DELTA_BYTE_ARRAY) to use?


[jira] [Created] (ARROW-8184) [Packaging] Use arrow-nightlies (or similar) organization name on Anaconda and Gemfury to host the nightlies

2020-03-23 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8184:
--

 Summary: [Packaging] Use arrow-nightlies (or similar) organization 
name on Anaconda and Gemfury to host the nightlies
 Key: ARROW-8184
 URL: https://issues.apache.org/jira/browse/ARROW-8184
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Currently I've set up the scripts to use Ursa Labs's accounts, but we should 
prefer a more neutral org.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8185) [Packaging] Document the available nightly wheels, conda and R packages under the development section

2020-03-23 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8185:
--

 Summary: [Packaging] Document the available nightly wheels, conda 
and R packages under the development section
 Key: ARROW-8185
 URL: https://issues.apache.org/jira/browse/ARROW-8185
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


The packaging scripts are uploading the artifacts to package manager specific 
hosting services like Anaconda and Gemfury. We should document this in a form 
which conforms the [ASF 
Policy|https://www.apache.org/dev/release-distribution.html#unreleased].

For more information see the conversation at 
https://github.com/apache/arrow/pull/6669#issuecomment-601947006



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss][FlightRPC] Extensions to Flight: "DoBidirectional"

2020-03-23 Thread David Li
Hey Wes,

Thanks for the review. I've broken out the format change into this PR:
https://github.com/apache/arrow/pull/6686

Best,
David

On 3/22/20, Wes McKinney  wrote:
> hi David,
>
> I did a preliminary view and things look to be on the right track
> there. What do you think about breaking out the protocol changes (and
> adding appropriate comments) so we can have a vote on that in
> relatively short order?
>
> - Wes
>
> On Wed, Mar 18, 2020 at 9:06 AM David Li  wrote:
>>
>> Following up here, I've submitted a draft implementation for C++:
>> https://github.com/apache/arrow/pull/6656
>>
>> The core functionality is there, but there are still holes that I need
>> to implement. Compared to the draft spec, the client also sends a
>> FlightDescriptor to begin with, though it's currently not exposed.
>> This provides consistency with DoGet/DoPut which also send a message
>> to begin with to describe the stream to the server.
>>
>> Andy, I hope this helps clarify whether it meets your needs.
>>
>> Best,
>> David
>>
>> On 2/25/20, David Li  wrote:
>> > Hey Andy,
>> >
>> > I've been rather busy unfortunately. I had started on an
>> > implementation in C++ to provide as part of this discussion, but it's
>> > not complete. I'm hoping to have more done in March.
>> >
>> > Best,
>> > David
>> >
>> > On 2/25/20, Andy Grove  wrote:
>> >> I was wondering if there had been any momentum on this (the
>> >> BiDirectional
>> >> RPC design)?
>> >>
>> >> I'm interested in this for the use case of Apache Spark sending a
>> >> stream
>> >> of
>> >> data to another process to invoke custom code and then receive a
>> >> stream
>> >> back with the transformed data.
>> >>
>> >> Thanks,
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >> On Fri, Dec 13, 2019 at 12:12 PM Jacques Nadeau 
>> >> wrote:
>> >>
>> >>> I support moving forward with the current proposal.
>> >>>
>> >>> On Thu, Dec 12, 2019 at 12:20 PM David Li 
>> >>> wrote:
>> >>>
>> >>> > Just following up here again, any other thoughts?
>> >>> >
>> >>> > I think we do have justifications for potentially separate streams
>> >>> > in
>> >>> > a call, but that's more of an orthogonal question - it doesn't need
>> >>> > to
>> >>> > be addressed here. I do agree that it very much complicates things.
>> >>> >
>> >>> > Thanks,
>> >>> > David
>> >>> >
>> >>> > On 11/29/19, Wes McKinney  wrote:
>> >>> > > I would generally agree with this. Note that you have the
>> >>> > > possibility
>> >>> > > to use unions-of-structs to send record batches with different
>> >>> > > schemas
>> >>> > > in the same stream, though with some added complexity on each
>> >>> > > side
>> >>> > >
>> >>> > > On Thu, Nov 28, 2019 at 10:37 AM Jacques Nadeau
>> >>> > > 
>> >>> > wrote:
>> >>> > >>
>> >>> > >> I'd vote for explicitly not supported. We should keep our
>> >>> > >> primitives
>> >>> > >> narrow.
>> >>> > >>
>> >>> > >> On Wed, Nov 27, 2019, 1:17 PM David Li 
>> >>> > >> wrote:
>> >>> > >>
>> >>> > >> > Thanks for the feedback.
>> >>> > >> >
>> >>> > >> > I do think if we had explicitly embraced gRPC from the
>> >>> > >> > beginning,
>> >>> > >> > there are a lot of places where things could be made more
>> >>> > >> > ergonomic,
>> >>> > >> > including with the metadata fields. But it would also have
>> >>> > >> > locked
>> >>> out
>> >>> > >> > us of potential future transports.
>> >>> > >> >
>> >>> > >> > On another note: I hesitate to put too much into this method,
>> >>> > >> > but
>> >>> > >> > we
>> >>> > >> > are looking at use cases where potentially, a client may want
>> >>> > >> > to
>> >>> > >> > upload multiple distinct datasets (with differing schemas).
>> >>> > >> > (This
>> >>> is a
>> >>> > >> > little tentative, and I can get more details...) Right now,
>> >>> > >> > each
>> >>> > >> > logical stream in Flight must have a single, consistent
>> >>> > >> > schema;
>> >>> would
>> >>> > >> > it make sense to look at ways to relax this, or declare this
>> >>> > >> > explicitly out of scope (and require multiple calls and
>> >>> > >> > coordination
>> >>> > >> > with the deployment topology) in order to accomplish this?
>> >>> > >> >
>> >>> > >> > Best,
>> >>> > >> > David
>> >>> > >> >
>> >>> > >> > On 11/27/19, Jacques Nadeau  wrote:
>> >>> > >> > > Fair enough. I'm okay with the bytes approach and the
>> >>> > >> > > proposal
>> >>> looks
>> >>> > >> > > good
>> >>> > >> > > to me.
>> >>> > >> > >
>> >>> > >> > > On Fri, Nov 8, 2019 at 11:37 AM David Li
>> >>> > >> > > 
>> >>> > >> > > wrote:
>> >>> > >> > >
>> >>> > >> > >> I've updated the proposal.
>> >>> > >> > >>
>> >>> > >> > >> On the subject of Protobuf Any vs bytes, and how to handle
>> >>> > >> > >> errors/metadata, I still think using bytes is preferable:
>> >>> > >> > >> - It doesn't require (conditionally) exposing or wrapping
>> >>> Protobuf
>> >>> > >> > types,
>> >>> > >> > >> - We wouldn't be able to practically expose the Protobuf
>> >>> > >> > >> field
>> >>> > >> > >> to
>> >>> > >> > >> C++
>> >>> > >> > >> users withou

[jira] [Created] (ARROW-8186) [Python] Dataset expression != returns bool instead of expression for invalid value

2020-03-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8186:


 Summary: [Python] Dataset expression != returns bool instead of 
expression for invalid value
 Key: ARROW-8186
 URL: https://issues.apache.org/jira/browse/ARROW-8186
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


It's a bit a strange case, but eg when doing {{!= {3}}} you get a boolean 
result instead of an expression:

{code}
In [8]: ds.field('col') != 3

   
Out[8]: 

In [9]: ds.field('col') != {3}  

   
Out[9]: True
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-23-0

2020-03-23 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-23-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0

Failed Tasks:
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-cpp-valgrind
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-test-r-linux-as-cran
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-wheel-manylinux2010-cp37m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-travis-wheel-osx-cp36m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-github-debian-buster
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03

[jira] [Created] (ARROW-8187) [R] Make test assertions robust to i18n

2020-03-23 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8187:
--

 Summary: [R] Make test assertions robust to i18n
 Key: ARROW-8187
 URL: https://issues.apache.org/jira/browse/ARROW-8187
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Antoine Pitrou
Assignee: Neal Richardson
 Fix For: 0.17.0


{code}
── 1. Failure: codec_is_available (@test-compressed.R#22)  ─
`codec_is_available("sdfasdf")` threw an error with unexpected message.
Expected match: "'arg' should be one of"
Actual message: "'arg' doit être un de “UNCOMPRESSED”, “SNAPPY”, “GZIP”, 
“BROTLI”, “ZSTD”, “LZ4”, “LZO”, “BZ2”"
Backtrace:
  1. testthat::expect_error(codec_is_available("sdfasdf"), "'arg' should be one 
of") testthat/test-compressed.R:22:2
  6. arrow::codec_is_available("sdfasdf")
  8. arrow:::compression_from_name(type)
  9. purrr::map_int(...)
 10. arrow:::.f(.x[[i]], ...)
 11. base::match.arg(toupper(.x), names(CompressionType))

── 2. Failure: time type unit validation (@test-data-type.R#298)  ──
`time32("years")` threw an error with unexpected message.
Expected match: "'arg' should be one of"
Actual message: "'arg' doit être un de “ms”, “s”"
Backtrace:
 1. testthat::expect_error(time32("years"), "'arg' should be one of") 
testthat/test-data-type.R:298:2
 6. arrow::time32("years")
 7. base::match.arg(unit)

── 3. Failure: time type unit validation (@test-data-type.R#305)  ──
`time64("years")` threw an error with unexpected message.
Expected match: "'arg' should be one of"
Actual message: "'arg' doit être un de “ns”, “us”"
Backtrace:
 1. testthat::expect_error(time64("years"), "'arg' should be one of") 
testthat/test-data-type.R:305:2
 6. arrow::time64("years")
 7. base::match.arg(unit)

── 4. Failure: decimal type and validation (@test-data-type.R#387)  
`decimal()` threw an error with unexpected message.
Expected match: "argument \"precision\" is missing, with no default"
Actual message: "l'argument \"precision\" est manquant, avec aucune valeur par 
défaut"
Backtrace:
 1. testthat::expect_error(decimal(), "argument \"precision\" is missing, with 
no default") testthat/test-data-type.R:387:2
 6. arrow::decimal()

── 5. Failure: decimal type and validation (@test-data-type.R#389)  
`decimal(4)` threw an error with unexpected message.
Expected match: "argument \"scale\" is missing, with no default"
Actual message: "l'argument \"scale\" est manquant, avec aucune valeur par 
défaut"
Backtrace:
 1. testthat::expect_error(decimal(4), "argument \"scale\" is missing, with no 
default") testthat/test-data-type.R:389:2
 6. arrow::decimal(4)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8188) [R] Adapt to latest checks in R-devel

2020-03-23 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8188:
--

 Summary: [R] Adapt to latest checks in R-devel
 Key: ARROW-8188
 URL: https://issues.apache.org/jira/browse/ARROW-8188
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0


See https://github.com/ursa-labs/crossbow/runs/526813242 for example.

1. checkbashisms now is complaining about a few things
2. Latest R-devel actually runs the donttest examples with --as-cran, and one 
fails. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Java API for parsing parque

2020-03-23 Thread Hasara Maithree
Hi all,

Is there a Java API for parsing Parque format to Arrow format?

Thank You


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-03-23 Thread David Li
Thanks. I've set up an AWS account for my own testing for now. I've
also submitted a PR to add a basic benchmark which can be run
self-contained, against a local Minio instance, or against S3:
https://github.com/apache/arrow/pull/6675

I ran the benchmark from my local machine, and I can test from EC2
sometime as well. Performance is not ideal, but I'm being limited by
my home internet connection - coalescing small chunked reads is (as
expected) as fast as reading the file in one go, and in the PR
(testing against localhost where we're not limited by bandwidth), it's
faster than either option.

--
Benchmark   Time   CPU
Iterations
--
MinioFixture/ReadAll1Mib/real_time  223416933 ns9098743 ns
   413   4.47594MB/s4.47594 items/s
MinioFixture/ReadAll100Mib/real_time   6068938152 ns  553319299 ns
10   16.4773MB/s   0.164773 items/s
MinioFixture/ReadAll500Mib/real_time   30735046155 ns 2620718364
ns  2   16.2681MB/s  0.0325361 items/s
MinioFixture/ReadChunked100Mib/real_time   9625661666 ns  448637141 ns
12   10.3889MB/s   0.103889 items/s
MinioFixture/ReadChunked500Mib/real_time   58736796101 ns 2070237834
ns  2   8.51255MB/s  0.0170251 items/s
MinioFixture/ReadCoalesced100Mib/real_time 6982902546 ns   22553824 ns
10   14.3207MB/s   0.143207 items/s
MinioFixture/ReadCoalesced500Mib/real_time 29923239648 ns  112736805
ns  3   16.7094MB/s  0.0334188 items/s
MinioFixture/ReadParquet250K/real_time 21934689795 ns 2052758161
ns  3   9.90955MB/s  0.0455899 items/s

Best,
David


On 3/22/20, Wes McKinney  wrote:
> On Thu, Mar 19, 2020 at 10:04 AM David Li  wrote:
>>
>> > That's why it's important that we set ourselves up to do performance
>> > testing in a realistic environment in AWS rather than simulating it.
>>
>> For my clarification, what are the plans for this (if any)? I couldn't
>> find any prior discussion, though it sounds like the discussion around
>> cloud CI capacity would be one step towards this.
>>
>> In the short term we could make tests/benchmarks configurable to not
>> point at a Minio instance so individual developers can at least try
>> things.
>
> It probably makes sense to begin investing in somewhat portable
> tooling to assist with running S3-related unit tests and benchmarks
> inside AWS. This could include initial Parquet dataset generation and
> other things.
>
> As far as testing, I'm happy to pay for some AWS costs (within
> reason). AWS might be able to donate some credits to us also
>
>> Best,
>> David
>>
>> On 3/18/20, David Li  wrote:
>> > For us it applies to S3-like systems, not only S3 itself, at least.
>> >
>> > It does make sense to limit it to some filesystems. The behavior would
>> > be opt-in at the Parquet reader level, so at the Datasets or
>> > Filesystem layer we can take care of enabling the flag for filesystems
>> > where it actually helps.
>> >
>> > I've filed these issues:
>> > - ARROW-8151 to benchmark S3File+Parquet
>> > (https://issues.apache.org/jira/browse/ARROW-8151)
>> > - ARROW-8152 to split large reads
>> > (https://issues.apache.org/jira/browse/ARROW-8152)
>> > - PARQUET-1820 to use a column filter hint with coalescing
>> > (https://issues.apache.org/jira/browse/PARQUET-1820)
>> >
>> > in addition to PARQUET-1698 which is just about pre-buffering the
>> > entire row group (which we can now do with ARROW-7995).
>> >
>> > Best,
>> > David
>> >
>> > On 3/18/20, Antoine Pitrou  wrote:
>> >>
>> >> Le 18/03/2020 à 18:30, David Li a écrit :
>>  Instead of S3, you can use the Slow streams and Slow filesystem
>>  implementations.  It may better protect against varying external
>>  conditions.
>> >>>
>> >>> I think we'd want several different benchmarks - we want to ensure we
>> >>> don't regress local filesystem performance, and we also want to
>> >>> measure in an actual S3 environment. It would also be good to measure
>> >>> S3-compatible systems like Google's.
>> >>>
>> > - Use the coalescing inside the Parquet reader (even without a
>> > column
>> > filter hint - this would subsume PARQUET-1698)
>> 
>>  I'm assuming this would be done at the RowGroupReader level, right?
>> >>>
>> >>> Ideally we'd be able to coalesce across row groups as well, though
>> >>> maybe it'd be easier to start with within-row-group-only (I need to
>> >>> familiarize myself with the reader more).
>> >>>
>>  I don't understand what the "advantage" would be.  Can you
>>  elaborate?
>> >>>
>> >>> As Wes said, empirically you can get more bandwidth out of S3 with
>> >>> multiple concurrent HTTP requests. There is a cost to doing so
>> >>> (establishing a new connection takes time), hence why the coalescing
>> >>> tries to group small reads (to fully utilize one co

[jira] [Created] (ARROW-8189) [Python] Python bindings for C++ Builder classes

2020-03-23 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8189:
---

 Summary: [Python] Python bindings for C++ Builder classes
 Key: ARROW-8189
 URL: https://issues.apache.org/jira/browse/ARROW-8189
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


I thought there was a JIRA about this already. It would be useful to have 
minimal exposure of the builder classes for use in Python or Cython



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-23 Thread Wes McKinney
hi folks,

Sorry it's taken me a little while to produce supporting benchmarks.

* I implemented experimental trivial body buffer compression in
https://github.com/apache/arrow/pull/6638
* I hooked up the Arrow IPC file format with compression as the new
Feather V2 format in
https://github.com/apache/arrow/pull/6694#issuecomment-602906476

I tested a couple of real-world datasets from a prior blog post
https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
codecs

The complete results are here
https://github.com/apache/arrow/pull/6694#issuecomment-602906476

Summary:

* Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
dataset. So that's a huge space savings
* Single-threaded decompression times exceeding 2-4GByte/s with LZ4
and 1.2-3GByte/s with ZSTD

I would have to do some more engineering to test throughput changes
with Flight, but given these results on slower networking (e.g. 1
Gigabit) my guess is that the compression and decompression overhead
is little compared with the time savings due to high compression
ratios. If people would like to see these numbers to help make a
decision I can take a closer look

As far as what Micah said about having a limited number of
compressors: I would be in favor of having just LZ4 and ZSTD. It seems
anecdotally that these outperform Snappy in most real world scenarios
and generally have > 1 GB/s decompression performance. Some Linux
distributions (Arch at least) have already started adopting ZSTD over
LZMA or GZIP [1]

- Wes

[1]: 
https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/

On Fri, Mar 6, 2020 at 8:42 AM Fan Liya  wrote:
>
> Hi Wes,
>
> Thanks a lot for the additional information.
> Looking forward to see the good results from your experiments.
>
> Best,
> Liya Fan
>
> On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney  wrote:
>
> > I see, thank you.
> >
> > For such a scenario, implementations would need to define a
> > "UserDefinedCodec" interface to enable codecs to be registered from
> > third party code, similar to what is done for extension types [1]
> >
> > I'll update this thread when I get my experimental C++ patch up to see
> > what I'm thinking at least for the built-in codecs we have like ZSTD.
> >
> >
> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> >
> > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya  wrote:
> > >
> > > Hi Wes,
> > >
> > > Thanks a lot for your further clarification.
> > >
> > > Some of my prelimiary thoughts:
> > >
> > > 1. We assign a unique GUID to each pair of compression/decompression
> > > strategies. The GUID is stored as part of the Message.custom_metadata.
> > When
> > > receiving the GUID, the receiver knows which decompression strategy to
> > use.
> > >
> > > 2. We serialize the decompression strategy, and store it into the
> > > Message.custom_metadata. The receiver can decompress data after
> > > deserializing the strategy.
> > >
> > > Method 1 is generally used in static strategy scenarios while method 2 is
> > > generally used in dynamic strategy scenarios.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney 
> > wrote:
> > >
> > > > Okay, I guess my question is how the receiver is going to be able to
> > > > determine how to "rehydrate" the record batch buffers:
> > > >
> > > > What I've proposed amounts to the following:
> > > >
> > > > * UNCOMPRESSED: the current behavior
> > > > * ZSTD/LZ4/...: each buffer is compressed and written with an int64
> > > > length prefix
> > > >
> > > > (I'm close to putting up a PR implementing an experimental version of
> > > > this that uses Message.custom_metadata to transmit the codec, so this
> > > > will make the implementation details more concrete)
> > > >
> > > > So in the USER_DEFINED case, how will the library know how to obtain
> > > > the uncompressed buffer? Is some additional metadata structure
> > > > required to provide instructions?
> > > >
> > > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya  wrote:
> > > > >
> > > > > Hi Wes,
> > > > >
> > > > > I am thinking of adding an option named "USER_DEFINED" (or something
> > > > > similar) to enum CompressionType in your proposal.
> > > > > IMO, this option should be used primarily in Flight.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > On Tue, Mar 3, 2020, 8:11 PM Fan Liya 
> > wrote:
> > > > > >
> > > > > > > Sure. I agree with you that we should not overdo this.
> > > > > > > I am wondering if we should provide an option to allow users to
> > > > plugin
> > > > > > > their customized compression strategies.
> > > > > > >
> > > > > >
> > > > > > Can you provide a patch showing changes to Message.fbs (or
> > Schema.fbs)
> > > > that
> > > > > > make this idea more concrete?
> > > > > >
> > > > >

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-23 Thread Antoine Pitrou


Le 24/03/2020 à 00:39, Wes McKinney a écrit :
> 
> As far as what Micah said about having a limited number of
> compressors: I would be in favor of having just LZ4 and ZSTD.

+1, exactly my thought as well.

Regards

Antoine.


Re: Java API for parsing parque

2020-03-23 Thread Wes McKinney
There's an effort to expose the C++ Parquet library to Java via JNI
that seems promising

https://issues.apache.org/jira/browse/ARROW-6720

On Mon, Mar 23, 2020 at 11:15 AM Hasara Maithree
 wrote:
>
> Hi all,
>
> Is there a Java API for parsing Parque format to Arrow format?
>
> Thank You


Re: Write a parquet file with delta encoding enable

2020-03-23 Thread Wes McKinney
These encodings are not available for use in the Parquet C++ library
yet -- partially implemented but not thoroughly tested or exposed in
the public API -- so it's not possible to generate them from Python. I
don't know about Java, you may want to ask on the Parquet mailing list

On Mon, Mar 23, 2020 at 2:30 AM Omega Gamage  wrote:
>
> I was trying to write a parquet file with delta encoding. This page
> , states
> that parquet supports three types of delta encodings:
>
> (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).
>
> Since spark, pyspark or pyarrow does not allow us to specify the encoding
> method. I was curious how one can write a file with delta encoding enabled?
>
> However, I found on the internet that, if I have columns with TimeStamp
> type parquet will use delta encoding. So I used the following code in
> *Scala* to create a parquet file. But encoding is not a delta.
>
>
> val df = Seq(("2018-05-01"),
> ("2018-05-02"),
> ("2018-05-03"),
> ("2018-05-04"),
> ("2018-05-05"),
> ("2018-05-06"),
> ("2018-05-07"),
> ("2018-05-08"),
> ("2018-05-09"),
> ("2018-05-10")
> ).toDF("Id")
> val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
> val df3 = df2.withColumn("Date", (col("Id").cast("date")))
>
> df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")
>
> parquet-tools shows the following information regarding the written parquet
> file.
>
> file schema: spark_schema
> Id:
>  OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
> R:0 D:1Date:OPTIONAL INT32 L:DATE R:0 D:1
>
> row group 1: RC:31 TS:1100 OFFSET:4
> Id:
>   BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
> ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]Timestamp:INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
> VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
> not defined]Date: INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
> VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]
>
> As you can see, no column has used delta encoding.
>
> My question is,
>
> 1) How can I write a parquet file with delta encoding? (If you can provide
> an example code in scala or python that would be great.) 2) How to decide
> which "delta encoding": (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY,
> DELTA_BYTE_ARRAY) to use?


[jira] [Created] (ARROW-8190) [C++][Flight] Allow setting IpcWriteOptions and IpcReadOptions in Flight IPC message reader and writer classes

2020-03-23 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8190:
---

 Summary: [C++][Flight] Allow setting IpcWriteOptions and 
IpcReadOptions in Flight IPC message reader and writer classes
 Key: ARROW-8190
 URL: https://issues.apache.org/jira/browse/ARROW-8190
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, FlightRPC
Reporter: Wes McKinney


Follow up work to ARROW-7979



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8191) [Packaging][APT] Fix cmake removal in Debian GNU/Linux Stretch

2020-03-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8191:
---

 Summary: [Packaging][APT] Fix cmake removal in Debian GNU/Linux 
Stretch
 Key: ARROW-8191
 URL: https://issues.apache.org/jira/browse/ARROW-8191
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-23-1

2020-03-23 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-23-1

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1

Failed Tasks:
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-cpp-valgrind
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-test-r-linux-as-cran
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-github-debian-buster
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-23-1-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:

[jira] [Created] (ARROW-8192) [C++] script for unpack avx512 intrinsics code

2020-03-23 Thread Frank Du (Jira)
Frank Du created ARROW-8192:
---

 Summary: [C++] script for unpack avx512 intrinsics code
 Key: ARROW-8192
 URL: https://issues.apache.org/jira/browse/ARROW-8192
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Use script to generate avx512 intrinsics code for arrow/util/bpacking function, 
similar to [https://github.com/lemire/FrameOfReference/tree/master/scripts]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8193) [C++] arrow-future-test fails to compile on gcc 4.8

2020-03-23 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8193:
---

 Summary: [C++] arrow-future-test fails to compile on gcc 4.8
 Key: ARROW-8193
 URL: https://issues.apache.org/jira/browse/ARROW-8193
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


{code}
In file included from /usr/include/c++/4.8/memory:64:0,
 from /home/wesm/code/arrow/cpp/src/arrow/util/future.h:22,
 from 
/home/wesm/code/arrow/cpp/src/arrow/util/future_test.cc:18:
/usr/include/c++/4.8/bits/stl_construct.h: In instantiation of ‘void 
std::_Construct(_T1*, _Args&& ...) [with _T1 = arrow::MoveOnlyDataType; _Args = 
{const arrow::MoveOnlyDataType&}]’:
/usr/include/c++/4.8/bits/stl_uninitialized.h:75:53:   required from ‘static 
_ForwardIterator 
std::__uninitialized_copy<_TrivialValueTypes>::__uninit_copy(_InputIterator, 
_InputIterator, _ForwardIterator) [with _InputIterator = 
__gnu_cxx::__normal_iterator > 
>; _ForwardIterator = arrow::MoveOnlyDataType*; bool _TrivialValueTypes = 
false]’
/usr/include/c++/4.8/bits/stl_uninitialized.h:117:41:   required from 
‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, 
_ForwardIterator) [with _InputIterator = __gnu_cxx::__normal_iterator > >; _ForwardIterator = 
arrow::MoveOnlyDataType*]’
/usr/include/c++/4.8/bits/stl_uninitialized.h:258:63:   required from 
‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, 
_ForwardIterator, std::allocator<_Tp>&) [with _InputIterator = 
__gnu_cxx::__normal_iterator > 
>; _ForwardIterator = arrow::MoveOnlyDataType*; _Tp = arrow::MoveOnlyDataType]’
/usr/include/c++/4.8/bits/stl_vector.h:316:32:   required from 
‘std::vector<_Tp, _Alloc>::vector(const std::vector<_Tp, _Alloc>&) [with _Tp = 
arrow::MoveOnlyDataType; _Alloc = std::allocator]’
/home/wesm/code/arrow/cpp/src/arrow/result.h:417:5:   required from ‘void 
arrow::Result::ConstructValue(U&&) [with U = 
std::vector 
>&; T = std::vector >]’
/home/wesm/code/arrow/cpp/src/arrow/result.h:167:42:   required from 
‘arrow::Result::Result(U&&) [with U = std::vector >&; E = void; T = 
std::vector >]’
/home/wesm/code/arrow/cpp/src/arrow/util/iterator.h:159:12:   required from 
‘arrow::Result > arrow::Iterator::ToVector() [with T 
= arrow::MoveOnlyDataType]’
/home/wesm/code/arrow/cpp/src/arrow/testing/gtest_util.h:419:3:   required from 
‘std::vector arrow::IteratorToVector(arrow::Iterator) [with T = 
arrow::MoveOnlyDataType]’
/home/wesm/code/arrow/cpp/src/arrow/util/future_test.cc:610:61:   required from 
‘void arrow::FutureTestBase::TestBasicAsCompleted() [with T = 
arrow::MoveOnlyDataType]’
/home/wesm/code/arrow/cpp/src/arrow/util/future_test.cc:708:52:   required from 
‘void 
arrow::FutureIteratorTest_BasicAsCompleted_Test::TestBody() 
[with gtest_TypeParam_ = arrow::MoveOnlyDataType]’
/home/wesm/code/arrow/cpp/src/arrow/type.h:756:20:   required from here
/usr/include/c++/4.8/bits/stl_construct.h:75:7: error: use of deleted function 
‘arrow::MoveOnlyDataType::MoveOnlyDataType(const arrow::MoveOnlyDataType&)’
 { ::new(static_cast(__p)) _T1(std::forward<_Args>(__args)...); }
   ^
/home/wesm/code/arrow/cpp/src/arrow/util/future_test.cc:70:3: error: declared 
here
   MoveOnlyDataType(const MoveOnlyDataType& other) = delete;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)