[ https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074387#comment-17074387 ]
Hongze Zhang commented on ARROW-7808: ------------------------------------- Thanks guys for your suggestions! And so sorry for such a late reply. I've been busy on some other stuffs and now I am continuing to work on this. Actually in my organization we have been maintaining an runnable implementation[1] for several months, it may not be completely ready for making an upstream PR but still be showing my main designs. I see your suggestion with the preference of high-level approach, with which actually I agree. In my current implementation, there might be some classes that look like something "lower level" in Java, such as DataFragment[2], or ScanTask[3], but further developers don't ever have to make implementations for specific source formats - we have NativeDataFragment[4] or NativeScanTask[5] to cover all cases. The same design is applied to DataSource[6][7] so we only have to bridge c++ DataSourceDiscovery implementations in further development. Here is an example[8] from us to add an arrow::dataset::SingleFileDataSource and use it from Java. And I know in the newest upstream code C++ API has been reworked a lot (the remove of DataSource, rename of DataSourceDiscovery and so on). So there should be some extra work to me to make things match during rebasing. Sorry again for the delay, and any thoughts please let me know. Thanks. [1] [https://github.com/zhztheplayer/arrow-1/commits/ARROW-7808] [2] [https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/fragment/DataFragment.java] [3] [https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanTask.java] [4] [https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataFragment.java] [5] [https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeScanTask.java] [6] [https://github.com/zhztheplayer/arrow-1/commit/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8#diff-deea6cb88ea63d76f71b7b4cfd173206] [7] [https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataSource.java] [8] [https://github.com/zhztheplayer/arrow-1/commit/7cb13b96e81fd153c4ad9c68aff00f032abb5110] > [Java][Dataset] Implement Datasets Java API > -------------------------------------------- > > Key: ARROW-7808 > URL: https://issues.apache.org/jira/browse/ARROW-7808 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, Java > Reporter: Hongze Zhang > Priority: Major > Labels: dataset > > Porting following C++ Datasets APIs to Java: > * DataSource > * DataSourceDiscovery > * DataFragment > * Dataset > * Scanner > * ScanTask > * ScanOptions -- This message was sent by Atlassian Jira (v8.3.4#803005)