Review request for ARROW-7808's PR (Dataset Java API)

2021-01-28 Thread Hongze Zhang
Hi All, Sorry to send a request to all but just would like to ask if anyone could be able to help finish the review for PR#7030[1]. As of now the PR contains following parts: 1. Base dataset API for Java language (which follows the shape of C++ API) 2. A JNI-based implementation of FileSyste

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Micah Kornfield
We should be extending the archery ipc integration tests for this (ideally no files checked in) On Thursday, January 28, 2021, Fan Liya wrote: > Hi Joris, > > The Java support for lz4 compression is on-going ( > https://github.com/apache/arrow/pull/8949). > Integration with C++/Python is not fin

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Fan Liya
Hi Joris, The Java support for lz4 compression is on-going ( https://github.com/apache/arrow/pull/8949). Integration with C++/Python is not finished yet. We would appreciate it if you could share the file to help us with the integration test. Best, Liya Fan On Fri, Jan 29, 2021 at 2:41 AM Antoi

Fast Data Explanation on GPU/FPGA

2021-01-28 Thread James Thomas
Hi All, I've been working on a library for fast explanation of tabular data: https://github.com/jjthomas/fast_data_explanation I've implemented acceleration on GPU and FPGA (using the Amazon F1 platform). I think this is an example of a pretty simple but useful workload going from Pandas to acce

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Antoine Pitrou
Le 28/01/2021 à 19:38, Wes McKinney a écrit : > It still seems notable that our generic LZ4-compressed output stream > cannot be read by Java (independent of Arrow and the Arrow IPC > format). That and the custom LZ4 framing used by Parquet-Java... Apparently the Java ecosystem can't implement p

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Wes McKinney
It still seems notable that our generic LZ4-compressed output stream cannot be read by Java (independent of Arrow and the Arrow IPC format). On Thu, Jan 28, 2021 at 12:30 PM Antoine Pitrou wrote: > > On Thu, 28 Jan 2021 18:19:00 + > Joris Peeters wrote: > > > To be fair, I'm happy to apply i

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Antoine Pitrou
On Thu, 28 Jan 2021 18:19:00 + Joris Peeters wrote: > To be fair, I'm happy to apply it at IPC level. Just didn't realise that > was a thing. IIUC what Antoine suggests, though, then just (leaving Python > as-is and) changing my Java to > > var is = new FileInputStream(path.toFile()); >

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Joris Peeters
Aha, OK! Thanks for the help all. I'll keep an eye on the Java side for the IPC compression, but for my current purpose doing full stream compression is totally fine. On Thu, Jan 28, 2021 at 6:22 PM Micah Kornfield wrote: > The application level compression Java support for compression is being

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Micah Kornfield
The application level compression Java support for compression is being worked on (I would need to double check if the PR has been merged) and I don't think its been integration tested with C++/Python I would imagine it would run into a similar issue with not being able to decode linked blocks.

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Joris Peeters
To be fair, I'm happy to apply it at IPC level. Just didn't realise that was a thing. IIUC what Antoine suggests, though, then just (leaving Python as-is and) changing my Java to var is = new FileInputStream(path.toFile()); var reader = new ArrowStreamReader(is, allocator); var schema

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Micah Kornfield
It might be worth opening up an issue with the lz4-java library. This seems like the java implementation doesn't fully support the LZ4 stream protocol? Antoine in this case it looks like Joris is applying the compression and decompression at the file level NOT the IPC level. On Thu, Jan 28, 2021

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Antoine Pitrou
Le 28/01/2021 à 17:59, Joris Peeters a écrit : > From Python, I'm dumping an LZ4-compressed arrow stream to a file, using > > with pa.output_stream(path, compression = 'lz4') as fh: > writer = pa.RecordBatchStreamWriter(fh, table.schema) > writer.write_table(table) >

Re: Pandas Block Manager

2021-01-28 Thread Wes McKinney
My position on this is that we should work with the pandas community to work toward elimination of the BlockManager data structure as this will solve a multitude of problems and also make things better for Arrow. I am not supportive of the IPC format changes in the PR. On Wed, Jan 27, 2021 at 6:27

Re: lz4 compressed arrow between Python & Java

2021-01-28 Thread Wes McKinney
hi Joris -- this isn't a use case that we intend for most users (we intend for users to instead use the LZ4 compression option that is part of the IPC format itself, rather than something that is layered on externally), but it would be good to make sure that our LZ4 streams are interoperable across

Re: Introducing Buzz, Arrow powered serverless query engine

2021-01-28 Thread Andrew Lamb
I would, for one, enjoy such a presentation On Thu, Jan 28, 2021 at 11:15 AM Rémi Dettai wrote: > Thank you for the support! I might do a quick (5 min) presentation during > the next Rust sync call if you are interested! > > Remi > > Le mer. 27 janv. 2021 à 19:40, Daniël Heres a > écrit : > > >

Re: [C++] Shall we modify the ORC reader?

2021-01-28 Thread Ying Zhou
Hi, Really thanks Deepak! I really want to edit the ORC reader to read ORC MAPs as Arrow MAPs now and it’s not a serious hassle to do so. Is there anyone who needs the read-ORC-maps-as-lists-of-structs functionality? If not I will do it likely in my current PR. Ying > On Jan 19, 2021, at 8:4

lz4 compressed arrow between Python & Java

2021-01-28 Thread Joris Peeters
>From Python, I'm dumping an LZ4-compressed arrow stream to a file, using with pa.output_stream(path, compression = 'lz4') as fh: writer = pa.RecordBatchStreamWriter(fh, table.schema) writer.write_table(table) writer.close() I then try reading this file from Java, star

Re: Introducing Buzz, Arrow powered serverless query engine

2021-01-28 Thread Rémi Dettai
Thank you for the support! I might do a quick (5 min) presentation during the next Rust sync call if you are interested! Remi Le mer. 27 janv. 2021 à 19:40, Daniël Heres a écrit : > This is really interesting Rémi! > > I like the interesting take on using "serverless" cloud components to build

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Fernando Herrera
Thanks Andrew and Jorge for the help. I think the use of the ScalarValue enum is precisely what I want. I was worried that downcasting the column every time you need to get a value would be slow but I can see that you are doing that with the ScalarValue enum ( https://github.com/apache/arrow/blob/

Re: [C++] Random table generator and table converter

2021-01-28 Thread Antoine Pitrou
Hi Ying, Le 28/01/2021 à 08:15, Ying Zhou a écrit : > > > By the way I haven’t found any function that can directly generate an Arrow > Table using a schema, size and null_probability. Is there any need for such > functionality? If this is useful for purposes beyond ORC/Parquet/CSV/etc IO >

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Fernando Herrera
In the application I'm working on I'm reading a parquet file and creating a table to keep the records in memory. This gist has the idea of it https://gist.github.com/elferherrera/a2a796ae83a7203f58de704c178c44ef I would like to keep it as pure Arrow because I have found that it is super fast to c

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Jorge Cardoso Leitão
I agree with Andrew (as usual) :) Irrespectively, maybe it is easier if you could describe what you are trying to accomplish, Fernando. There are possibly other ways of going about this, and maybe someone can help by knowing more context. Best, Jorge On Thu, Jan 28, 2021 at 1:06 PM Andrew Lamb

Re: [RUST] Slow Parquet writer

2021-01-28 Thread Fernando Herrera
Yes, I'm running my code with the --release flag. I've been looking everywhere but I can't find a way to make the writing faster. I dont know if it is a mistake I'm making with the structs or the Parquet crate needs optimizations. Fernando, On Thu, Jan 28, 2021 at 12:02 PM Andrew Lamb wrote: >

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Andrew Lamb
I think this approach would work (and we have something similar in DataFusion (ScalarValue) https://github.com/apache/arrow/blob/4b7cdcb9220b6d94b251aef32c21ef9b4097ecfa/rust/datafusion/src/scalar.rs#L46 -- though it is an enum rather than a Trait, I think the idea is basically the same) I think t

Re: [RUST] Slow Parquet writer

2021-01-28 Thread Andrew Lamb
The first thing I would check is that you are using a release build (`cargo build --release`) If you are, there may be additional optimizations needed in the Rust implementations Andrew On Thu, Jan 28, 2021 at 6:19 AM Fernando Herrera < fernando.j.herr...@gmail.com> wrote: > Hi, > > What is the

[RUST] Slow Parquet writer

2021-01-28 Thread Fernando Herrera
Hi, What is the writing speed that we should expect from the Arrow Parquet writer? I'm writing a RecordBatch with two columns and 1,000,000 records and it takes a lot of time to write the batch to the file (close to 2 secs). This is what I'm doing let schema = Schema::new(vec![ > Field::new

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Fernando Herrera
Hi Jorge, What about making the Array::value return a &dyn ValueTrait. This new ValueTrait would have to be implemented for all the possible values that can be returned from the arrays Fernando On Thu, 28 Jan 2021, 08:42 Jorge Cardoso Leitão, wrote: > Hi Fernando, > > I tried that some time ag

[NIGHTLY] Arrow Build Report for Job nightly-2021-01-28-0

2021-01-28 Thread Crossbow
Arrow Build Report for Job nightly-2021-01-28-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-28-0 Failed Tasks: - centos-8-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-01-28-0-travis-centos-8-aarch64 - con

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Jorge Cardoso Leitão
Hi Fernando, I tried that some time ago, but I was unable to do so. The reason is that Array is a trait that needs to support also being a trait object (i.e. support `&dyn Array`). Let's try here: what type should `Array::value` return? One option is to make Array a generic. But if Array is a gen

Re: [RUST] Implement value function with Array trait

2021-01-28 Thread Fernando Herrera
I see what you mean. I was thinking that the function signature would have to be something like this: trait Array { >fn value(&self) -> T > } Where T would have to implement another trait, call it ValueTrait, in order to define how to extract the different values types, e.g. &str, u32, etc.