[ 
https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074387#comment-17074387
 ] 

Hongze Zhang commented on ARROW-7808:
-------------------------------------

Thanks guys for your suggestions! And so sorry for such a late reply. I've been 
busy on some other stuffs and now I am continuing to work on this. Actually in 
my organization we have been maintaining an runnable implementation[1] for 
several months, it may not be completely ready for making an upstream PR but 
still be showing my main designs.

I see your suggestion with the preference of high-level approach, with which 
actually I agree. In my current implementation, there might be some classes 
that look like something "lower level" in Java, such as DataFragment[2], or 
ScanTask[3], but further developers don't ever have to make implementations for 
specific source formats - we have NativeDataFragment[4] or NativeScanTask[5] to 
cover all cases. The same design is applied to DataSource[6][7] so we only have 
to bridge c++ DataSourceDiscovery implementations in further development. Here 
is an example[8] from us to add an arrow::dataset::SingleFileDataSource and use 
it from Java.

And I know in the newest upstream code C++ API has been reworked a lot (the 
remove of DataSource, rename of DataSourceDiscovery and so on). So there should 
be some extra work to me to make things match during rebasing.

Sorry again for the delay, and any thoughts please let me know. Thanks.

[1] [https://github.com/zhztheplayer/arrow-1/commits/ARROW-7808]
 [2] 
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/fragment/DataFragment.java]
 [3] 
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanTask.java]
 [4] 
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataFragment.java]
 [5] 
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeScanTask.java]
 [6] 
[https://github.com/zhztheplayer/arrow-1/commit/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8#diff-deea6cb88ea63d76f71b7b4cfd173206]
 [7] 
[https://github.com/zhztheplayer/arrow-1/blob/b2e98bfaf24565a6b23ecd36b9c3c2c264be51f8/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataSource.java]
 [8] 
[https://github.com/zhztheplayer/arrow-1/commit/7cb13b96e81fd153c4ad9c68aff00f032abb5110]

> [Java][Dataset] Implement Datasets Java API 
> --------------------------------------------
>
>                 Key: ARROW-7808
>                 URL: https://issues.apache.org/jira/browse/ARROW-7808
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++ - Dataset, Java
>            Reporter: Hongze Zhang
>            Priority: Major
>              Labels: dataset
>
> Porting following C++ Datasets APIs to Java: 
> * DataSource 
> * DataSourceDiscovery 
> * DataFragment 
> * Dataset
> * Scanner 
> * ScanTask 
> * ScanOptions 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to