[ https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041100#comment-17041100 ]
Ben Kietzman commented on ARROW-7808: ------------------------------------- The dataset API is not stable; a full 1:1 mapping will be *more* work to maintain. For example https://issues.apache.org/jira/browse/ARROW-7886 would remove Source and SourceFactory altogether, which would necessitate refactoring both the JNI binding and the Java which uses it. I recommend exposing only classes which are directly useful for a minimal use case, then exposing classes as they become necessary in follow ups. [~fsaintjacques]'s recommendation on the mailing list would be an excellent starting point. Alternatively, I recommend following the initial R binding work: https://github.com/romainfrancois/arrow/blob/9dfba2ea8949a0a0a17393976a97d3a34dc63d39/r/R/dataset.R This minimally exposes Source, Dataset, Scanner, and the corresponding factories. Scans result in a materialized Table (so ScanTasks, Fragments, etc may remain hidden) and take full advantage of predicate/projection push down. > [Java][Dataset] Implement Datasets Java API > -------------------------------------------- > > Key: ARROW-7808 > URL: https://issues.apache.org/jira/browse/ARROW-7808 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, Java > Reporter: Hongze Zhang > Priority: Major > Labels: dataset > > Porting following C++ Datasets APIs to Java: > * DataSource > * DataSourceDiscovery > * DataFragment > * Dataset > * Scanner > * ScanTask > * ScanOptions -- This message was sent by Atlassian Jira (v8.3.4#803005)