[ https://issues.apache.org/jira/browse/ARROW-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485005#comment-17485005 ]
Weston Pace commented on ARROW-15317: ------------------------------------- If we go this route are we effectively defining yet another table format? Albeit a rather limited one. For example, we could say that a dataset can be created from a vector of (path, filesystem, partition_expression, row_groups). But then, in what format, as R objects? So, for example, if we had an "iceberg R package" then either: * The Iceberg R package imports the Arrow R package to get the R types * There is a third package the "Iceberg-R-Arrow-Adapter" package which converts from Iceberg R objects to Arrow R objects Also wandering along this path you also might brainstorm/encounter "A stable C ABI for datasets". On the other hand, Arrow's C++ lib could pick a table format (e.g. Iceberg) that we use to define our "datasets API". So then we could have an export "Arrow dataset -> Iceberg Table" and "Iceberg Table -> Arrow dataset" in the C++ lib. > [R] Expose API to create Dataset from Fragments > ----------------------------------------------- > > Key: ARROW-15317 > URL: https://issues.apache.org/jira/browse/ARROW-15317 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Affects Versions: 6.0.1 > Reporter: Will Jones > Priority: Minor > > Third-party packages may define dataset factories for table formats like > Delta Lake and Apache Iceberg. These formats store metadata like schema, file > lists, and file-level statistics on the side, and can construct a dataset > without a discovery process needed. Python exposed enough API to do this > successfully for [a Delta Lake dataset reader > here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280]. > I propose adding the following to the R API: > * Expose {{Fragment}} as an R6 object > * Add the {{MakeFragment}} method to various file format objects. It's key > that {{partition_expression}} is included as an argument. ([See Python > equivalent > here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210]) > * Add a dataset constructor that takes a list of {{Fragments}} -- This message was sent by Atlassian Jira (v8.20.1#820001)