Re: Datasets and Java

Hongze Zhang Wed, 27 Nov 2019 22:29:56 -0800

Thanks for referencing this, Antoine. The concepts and principles seem to be 
pretty concrete so I
may take some time to read it in detail.


BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's 
unlikely clear whether
this one or ipc flatbuffers could be a better approach for Java/C++ 
interchange. Isn't it?

Best,
Hongze

[1] https://issues.apache.org/jira/browse/ARROW-7272



On Wed, 2019-11-27 at 11:19 +0100, Antoine Pitrou wrote:
> To set up bridges between Java and C++, the C data interface
> specification may help:
> https://github.com/apache/arrow/pull/5442
> 
> There's an implementation for C++ here, and it also includes a Python-R
> bridge able to share Arrow data between two different runtimes (i.e.
> PyArrow and R-Arrow were compiled potentially using different
> toolchains, with different ABIs):
> https://github.com/apache/arrow/pull/5608
> 
> Regards
> 
> Antoine.
> 
> 
> 
> Le 27/11/2019 à 11:16, Hongze Zhang a écrit :
> > Hi Micah,
> > 
> > 
> > Regarding our use cases, we'd use the API on Parquet files with some pushed 
> > filters and
> > projectors, and we'd extend the C++ Datasets code to provide necessary 
> > support for our own data
> > formats.
> > 
> > 
> > > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > > writing a gRPC wrapper around the DataSet metadata capabilities.  One 
> > > could
> > > then create a facade on top of that for Java.  For data reads, I can see
> > > either building a Flight server or directly use the JNI readers.
> > 
> > Thanks for your suggestion but I'm not entirely getting it. Does this mean 
> > to start some
> > individual gRPC/Flight server process to deal with the metadata/data 
> > exchange problem between
> > Java and C++ Datasets? If yes, then in some cases, doesn't it easily 
> > introduce bigger problems
> > about life cycle and resource management of the processes? Please correct 
> > me if I misunderstood
> > your point.
> > 
> > 
> > And IMHO I don't strongly hate the possible inconsistencies and bugs bought 
> > by a Java porting of
> > something like the Datasets framework. Inconsistencies are usually in a way 
> > inevitable between
> > two different languages' implementations of the same component, but there 
> > is supposed to be a
> > trade-off based on whether the implementations arre worth to be provided. I 
> > didn't have chance
> > to fully investigate the requirements of Datasets-Java from other projects 
> > so I'm not 100% sure
> > but the functionality such as source discovery, predicate pushdown, 
> > multi-format support could
> > be powerful for many scenarios. Anyway I'm totally with you that the work 
> > amount could be huge
> > and bugs might be brought. So my goal it to start from a small piece of the 
> > APIs to minimize the
> > initial work. What do you think?
> > 
> > 
> > Thanks,
> > Hongze
> > 
> > 
> > 
> > At 2019-11-27 16:00:35, "Micah Kornfield" <[email protected]> wrote:
> > > Hi Hongze,
> > > I have a strong preference for not porting non-trivial logic from one
> > > language to another, especially if the main goal is performance.  I think
> > > this will replicate bugs and cause confusion if inconsistencies occur.  It
> > > is also a non-trivial amount of work to develop, review, setup CI, etc.
> > > 
> > > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > > writing a gRPC wrapper around the DataSet metadata capabilities.  One 
> > > could
> > > then create a facade on top of that for Java.  For data reads, I can see
> > > either building a Flight server or directly use the JNI readers.
> > > 
> > > In either case this is a non-trivial amount of work, so I at least,
> > > would appreciate a short write-up (1-2 pages) explicitly stating
> > > goals/use-cases for the library and a high level design (component 
> > > overview
> > > and relationships between components and how it will co-exist with 
> > > existing
> > > Java code).  If I understand correctly, one goal is to use this as a basis
> > > for a new Spark DataSet API with better performance than the vectorized
> > > spark parquet reader?  Are there others?
> > > 
> > > Wes, what are your thoughts on this?
> > > 
> > > Thanks,
> > > Micah
> > > 
> > > 
> > > On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang <[email protected]> wrote:
> > > 
> > > > Hi Wes and Micah,
> > > > 
> > > > 
> > > > Thanks for your kindly reply.
> > > > 
> > > > 
> > > > Micah: We don't use Spark (vectorized) parquet reader because it is a 
> > > > pure
> > > > Java implementation. Performance could be worse than doing the similar 
> > > > work
> > > > natively. Another reason is we may need to
> > > > integrate some other specific data sources with Arrow datasets, for
> > > > limiting the workload, we would like to maintain a common read pipeline 
> > > > for
> > > > both this one and other wildly used data sources like Parquet and Csv.
> > > > 
> > > > 
> > > > Wes: Yes, Datasets framework along with Parquet/CSV/... reader
> > > > implementations are totally native, So a JNI bridge will be needed then 
> > > > we
> > > > don't actually read files in Java.
> > > > 
> > > > 
> > > > My another concern is how many C++ datasets components should be bridged
> > > > via JNI. For example,
> > > > bridge the ScanTask only? Or bridge more components including Scanner,
> > > > Table, even the DataSource
> > > > discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as
> > > > Micah said, orc-jni is
> > > > already there) and reimplement everything needed by datasets in Java? 
> > > > This
> > > > might be not that easy to
> > > > decide but currently based on my limited perspective I would prefer to 
> > > > get
> > > > started from the ScanTask
> > > > layer as a result we could leverage some valuable work finished in C++
> > > > datasets and don't have to
> > > > maintain too much tedious JNI code. The real IO process still take place
> > > > inside C++ readers when we
> > > > do scan operation.
> > > > 
> > > > 
> > > > So Wes, Micah, is this similar to your consideration?
> > > > 
> > > > 
> > > > Thanks,
> > > > Hongze
> > > > 
> > > > At 2019-11-27 12:39:52, "Micah Kornfield" <[email protected]> wrote:
> > > > > Hi Hongze,
> > > > > To add to Wes's point, there are already some efforts to do JNI for 
> > > > > ORC
> > > > > (which needs to be integrated with CI) and some open PRs for Parquet 
> > > > > in
> > > > the
> > > > > project.  However, given that you are using Spark I would expect 
> > > > > there is
> > > > > already dataset functionality that is equivalent to the dataset API 
> > > > > to do
> > > > > rowgroup/partition level filtering.  Can you elaborate on what 
> > > > > problems
> > > > you
> > > > > are seeing with those and what additional use cases you have?
> > > > > 
> > > > > Thanks,
> > > > > Micah
> > > > > 
> > > > > 
> > > > > On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <[email protected]> 
> > > > > wrote:
> > > > > 
> > > > > > hi Hongze,
> > > > > > 
> > > > > > The Datasets functionality is indeed extremely useful, and it may 
> > > > > > make
> > > > > > sense to have it available in many languages eventually. With Java, 
> > > > > > I
> > > > > > would raise the issue that things are comparatively weaker there 
> > > > > > when
> > > > > > it comes to actually reading the files themselves. Whereas we have
> > > > > > reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and 
> > > > > > Parquet
> > > > > > in C++ the same is not true in Java. Not a deal breaker but worth
> > > > > > taking into consideration.
> > > > > > 
> > > > > > I wonder aloud whether it might be worth investing in a JNI-based
> > > > > > interface to the C++ libraries as one potential approach to save on
> > > > > > development time.
> > > > > > 
> > > > > > - Wes
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <[email protected]> 
> > > > > > wrote:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > 
> > > > > > > Recently the datasets API has been improved a lot and I found 
> > > > > > > some of
> > > > > > the new features are very useful to my own work. For example to me a
> > > > > > important one is the fix of ARROW-6952[1]. And as I currently work 
> > > > > > on
> > > > > > Java/Scala projects like Spark, I am now investigating a way to call
> > > > some
> > > > > > of the datasets APIs in Java so that I could gain performance
> > > > improvement
> > > > > > from native dataset filters/projectors. Meantime I am also 
> > > > > > interested in
> > > > > > the ability of scanning different data sources provided by dataset 
> > > > > > API.
> > > > > > > 
> > > > > > > Regarding using datasets in Java, my initial idea is to port (by
> > > > writing
> > > > > > Java-version implementations) some of the high-level concepts in 
> > > > > > Java
> > > > such
> > > > > > as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and 
> > > > > > call
> > > > > > lower level record batch iterators via JNI. This way we seem to 
> > > > > > retain
> > > > > > performance advantages from c++ dataset code.
> > > > > > > 
> > > > > > > Is anyone interested in this topic also? Or is this something 
> > > > > > > already
> > > > on
> > > > > > the development plan? Any feedback or thoughts would be much
> > > > appreciated.
> > > > > > > 
> > > > > > > Best,
> > > > > > > Hongze
> > > > > > > 
> > > > > > > 
> > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-6952

Re: Datasets and Java

Reply via email to