Re: Java dataframe library for arrow suggestions

2021-03-16 Thread Micah Kornfield
There was a little bit of effort previously in Arrow to start building this out (see the algorithms package), but we tabled it due to the large scope and availability of maintainers for it. On Tue, Mar 16, 2021 at 4:36 PM Wes McKinney wrote: > This has been asked several times in the past but

Re: [C++] - How to parallelize parquet column read operation

2021-03-16 Thread Yeshwanth Sriram
This worked out well. I’m able to see multiple `ReadAt` calls concurrently and of course it waits for these calls to complete. The overall end2end latency of job is much lower now. For anyone else wanting to know the rough sequence to create parquet reader with parallelized column readers is

Re: [C++] - How to parallelize parquet column read operation

2021-03-16 Thread Weston Pace
The parquet::arrow::FileReader class takes in parquest::ArrowReaderProperties which have a use_threads option. If true then the reader will parallelize column reads. This flag is used in parquet/arrow/reader.cc to parallelize column reads (search for OptionalParallelFor). This may or may not

Re: Java dataframe library for arrow suggestions

2021-03-16 Thread Chris Nuernberger
There is a JVM based dataframe library: https://github.com/techascent/tech.ml.dataset There are dplyr-like bindings for it: https://github.com/scicloj/tablecloth It supports mmap/in-place loading of array files (which the Java SDK does not): https://techascent.com/blog/memory-mapping-arrow.html

Re: Java dataframe library for arrow suggestions

2021-03-16 Thread Andy Grove
This isn't directly related to the question, but I was reading about the newly released JDK 16 today and there is initial support for explicit vectorized operations, which might be interesting to explore for anyone considering building a Java DataFrame implementation.

Re: Java dataframe library for arrow suggestions

2021-03-16 Thread Andrew Melo
I can't speak to how complete it is, but I looked earlier for something similar and ran across https://github.com/deeplearning4j/nd4j .. it's probably not an exact fit, but it does appear to be able to consume arrow buffers and expose them to java. Cheers Andrew On Tue, Mar 16, 2021 at 6:36 PM

Java dataframe library for arrow suggestions

2021-03-16 Thread Paul Whalen
Hi, I've been using Arrow for some time now, mostly in the context of Arrow Flight between Java and Python. While it's quite easy to convert Arrow data in Python to a pandas dataframe and manipulate it, I'm struggling to find an obvious analogue on the Java side. VectorSchemaRoot is useful for

Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?

2021-03-16 Thread Quentin Lhoest
I just created https://issues.apache.org/jira/browse/ARROW-11989 > On Mar 16, 2021, at 6:54 PM, Wes McKinney wrote: > > Is there a Jira tracking this performance improvement? At minimum > getting to O(log k) indexing time where k is the number of chunks > would be a good goal > > On Mon, Mar

Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?

2021-03-16 Thread Wes McKinney
Is there a Jira tracking this performance improvement? At minimum getting to O(log k) indexing time where k is the number of chunks would be a good goal On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield wrote: > > One more micro optimization would be to use interpolation search instead of >

[C++] - How to parallelize parquet column read operation

2021-03-16 Thread Yeshwanth Sriram
Hello, I’ve managed to implement ADLFS/gen2 filesystem with reader/writers. I’m also able to read through data from ADLFS via parquet reader using my implementation. It is modeled like the s3fs implementation. Question. - Is way to parallelize the column read operation using multiple threads