Thanks for referencing this, Antoine. The concepts and principles seem to be pretty concrete so I may take some time to read it in detail.
BTW I noticed that by the current discussion in ticket ARROW-7272[1] it's unlikely clear whether this one or ipc flatbuffers could be a better approach for Java/C++ interchange. Isn't it? Best, Hongze [1] https://issues.apache.org/jira/browse/ARROW-7272 On Wed, 2019-11-27 at 11:19 +0100, Antoine Pitrou wrote: > To set up bridges between Java and C++, the C data interface > specification may help: > https://github.com/apache/arrow/pull/5442 > > There's an implementation for C++ here, and it also includes a Python-R > bridge able to share Arrow data between two different runtimes (i.e. > PyArrow and R-Arrow were compiled potentially using different > toolchains, with different ABIs): > https://github.com/apache/arrow/pull/5608 > > Regards > > Antoine. > > > > Le 27/11/2019 à 11:16, Hongze Zhang a écrit : > > Hi Micah, > > > > > > Regarding our use cases, we'd use the API on Parquet files with some pushed > > filters and > > projectors, and we'd extend the C++ Datasets code to provide necessary > > support for our own data > > formats. > > > > > > > If JNI is seen as too cumbersome, another possible avenue to pursue is > > > writing a gRPC wrapper around the DataSet metadata capabilities. One > > > could > > > then create a facade on top of that for Java. For data reads, I can see > > > either building a Flight server or directly use the JNI readers. > > > > Thanks for your suggestion but I'm not entirely getting it. Does this mean > > to start some > > individual gRPC/Flight server process to deal with the metadata/data > > exchange problem between > > Java and C++ Datasets? If yes, then in some cases, doesn't it easily > > introduce bigger problems > > about life cycle and resource management of the processes? Please correct > > me if I misunderstood > > your point. > > > > > > And IMHO I don't strongly hate the possible inconsistencies and bugs bought > > by a Java porting of > > something like the Datasets framework. Inconsistencies are usually in a way > > inevitable between > > two different languages' implementations of the same component, but there > > is supposed to be a > > trade-off based on whether the implementations arre worth to be provided. I > > didn't have chance > > to fully investigate the requirements of Datasets-Java from other projects > > so I'm not 100% sure > > but the functionality such as source discovery, predicate pushdown, > > multi-format support could > > be powerful for many scenarios. Anyway I'm totally with you that the work > > amount could be huge > > and bugs might be brought. So my goal it to start from a small piece of the > > APIs to minimize the > > initial work. What do you think? > > > > > > Thanks, > > Hongze > > > > > > > > At 2019-11-27 16:00:35, "Micah Kornfield" <emkornfi...@gmail.com> wrote: > > > Hi Hongze, > > > I have a strong preference for not porting non-trivial logic from one > > > language to another, especially if the main goal is performance. I think > > > this will replicate bugs and cause confusion if inconsistencies occur. It > > > is also a non-trivial amount of work to develop, review, setup CI, etc. > > > > > > If JNI is seen as too cumbersome, another possible avenue to pursue is > > > writing a gRPC wrapper around the DataSet metadata capabilities. One > > > could > > > then create a facade on top of that for Java. For data reads, I can see > > > either building a Flight server or directly use the JNI readers. > > > > > > In either case this is a non-trivial amount of work, so I at least, > > > would appreciate a short write-up (1-2 pages) explicitly stating > > > goals/use-cases for the library and a high level design (component > > > overview > > > and relationships between components and how it will co-exist with > > > existing > > > Java code). If I understand correctly, one goal is to use this as a basis > > > for a new Spark DataSet API with better performance than the vectorized > > > spark parquet reader? Are there others? > > > > > > Wes, what are your thoughts on this? > > > > > > Thanks, > > > Micah > > > > > > > > > On Tue, Nov 26, 2019 at 10:51 PM Hongze Zhang <notify...@126.com> wrote: > > > > > > > Hi Wes and Micah, > > > > > > > > > > > > Thanks for your kindly reply. > > > > > > > > > > > > Micah: We don't use Spark (vectorized) parquet reader because it is a > > > > pure > > > > Java implementation. Performance could be worse than doing the similar > > > > work > > > > natively. Another reason is we may need to > > > > integrate some other specific data sources with Arrow datasets, for > > > > limiting the workload, we would like to maintain a common read pipeline > > > > for > > > > both this one and other wildly used data sources like Parquet and Csv. > > > > > > > > > > > > Wes: Yes, Datasets framework along with Parquet/CSV/... reader > > > > implementations are totally native, So a JNI bridge will be needed then > > > > we > > > > don't actually read files in Java. > > > > > > > > > > > > My another concern is how many C++ datasets components should be bridged > > > > via JNI. For example, > > > > bridge the ScanTask only? Or bridge more components including Scanner, > > > > Table, even the DataSource > > > > discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as > > > > Micah said, orc-jni is > > > > already there) and reimplement everything needed by datasets in Java? > > > > This > > > > might be not that easy to > > > > decide but currently based on my limited perspective I would prefer to > > > > get > > > > started from the ScanTask > > > > layer as a result we could leverage some valuable work finished in C++ > > > > datasets and don't have to > > > > maintain too much tedious JNI code. The real IO process still take place > > > > inside C++ readers when we > > > > do scan operation. > > > > > > > > > > > > So Wes, Micah, is this similar to your consideration? > > > > > > > > > > > > Thanks, > > > > Hongze > > > > > > > > At 2019-11-27 12:39:52, "Micah Kornfield" <emkornfi...@gmail.com> wrote: > > > > > Hi Hongze, > > > > > To add to Wes's point, there are already some efforts to do JNI for > > > > > ORC > > > > > (which needs to be integrated with CI) and some open PRs for Parquet > > > > > in > > > > the > > > > > project. However, given that you are using Spark I would expect > > > > > there is > > > > > already dataset functionality that is equivalent to the dataset API > > > > > to do > > > > > rowgroup/partition level filtering. Can you elaborate on what > > > > > problems > > > > you > > > > > are seeing with those and what additional use cases you have? > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > > > > > > On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney <wesmck...@gmail.com> > > > > > wrote: > > > > > > > > > > > hi Hongze, > > > > > > > > > > > > The Datasets functionality is indeed extremely useful, and it may > > > > > > make > > > > > > sense to have it available in many languages eventually. With Java, > > > > > > I > > > > > > would raise the issue that things are comparatively weaker there > > > > > > when > > > > > > it comes to actually reading the files themselves. Whereas we have > > > > > > reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and > > > > > > Parquet > > > > > > in C++ the same is not true in Java. Not a deal breaker but worth > > > > > > taking into consideration. > > > > > > > > > > > > I wonder aloud whether it might be worth investing in a JNI-based > > > > > > interface to the C++ libraries as one potential approach to save on > > > > > > development time. > > > > > > > > > > > > - Wes > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang <notify...@126.com> > > > > > > wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > Recently the datasets API has been improved a lot and I found > > > > > > > some of > > > > > > the new features are very useful to my own work. For example to me a > > > > > > important one is the fix of ARROW-6952[1]. And as I currently work > > > > > > on > > > > > > Java/Scala projects like Spark, I am now investigating a way to call > > > > some > > > > > > of the datasets APIs in Java so that I could gain performance > > > > improvement > > > > > > from native dataset filters/projectors. Meantime I am also > > > > > > interested in > > > > > > the ability of scanning different data sources provided by dataset > > > > > > API. > > > > > > > > > > > > > > Regarding using datasets in Java, my initial idea is to port (by > > > > writing > > > > > > Java-version implementations) some of the high-level concepts in > > > > > > Java > > > > such > > > > > > as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and > > > > > > call > > > > > > lower level record batch iterators via JNI. This way we seem to > > > > > > retain > > > > > > performance advantages from c++ dataset code. > > > > > > > > > > > > > > Is anyone interested in this topic also? Or is this something > > > > > > > already > > > > on > > > > > > the development plan? Any feedback or thoughts would be much > > > > appreciated. > > > > > > > > > > > > > > Best, > > > > > > > Hongze > > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-6952