[ 
https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041100#comment-17041100
 ] 

Ben Kietzman commented on ARROW-7808:
-------------------------------------

The dataset API is not stable; a full 1:1 mapping will be *more* work to 
maintain. For example https://issues.apache.org/jira/browse/ARROW-7886 would 
remove Source and SourceFactory altogether, which would necessitate refactoring 
both the JNI binding and the Java which uses it. I recommend exposing only 
classes which are directly useful for a minimal use case, then exposing classes 
as they become necessary in follow ups.

[~fsaintjacques]'s recommendation on the mailing list would be an excellent 
starting point. Alternatively, I recommend following the initial R binding 
work: 
https://github.com/romainfrancois/arrow/blob/9dfba2ea8949a0a0a17393976a97d3a34dc63d39/r/R/dataset.R
 This minimally exposes Source, Dataset, Scanner, and the corresponding 
factories. Scans result in a materialized Table (so ScanTasks, Fragments, etc 
may remain hidden) and take full advantage of predicate/projection push down.

> [Java][Dataset] Implement Datasets Java API 
> --------------------------------------------
>
>                 Key: ARROW-7808
>                 URL: https://issues.apache.org/jira/browse/ARROW-7808
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++ - Dataset, Java
>            Reporter: Hongze Zhang
>            Priority: Major
>              Labels: dataset
>
> Porting following C++ Datasets APIs to Java: 
> * DataSource 
> * DataSourceDiscovery 
> * DataFragment 
> * Dataset
> * Scanner 
> * ScanTask 
> * ScanOptions 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to