[ https://issues.apache.org/jira/browse/ARROW-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li updated ARROW-15281: ----------------------------- Labels: dataset query-engine (was: ) > [C++] Implement ability to retrieve fragment filename > ----------------------------------------------------- > > Key: ARROW-15281 > URL: https://issues.apache.org/jira/browse/ARROW-15281 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Nicola Crane > Priority: Major > Labels: dataset, query-engine > > A user has requested the ability to include the filename of the CSV in the > dataset output - see discussion on ARROW-15260 for more context. > Relevant info from that ticket: > > "From a C++ perspective we've got many of the pieces needed already. One > challenge is that the datasets API is written to work with "fragments" and > not "files". For example, a dataset might be an in-memory table in which case > we are working with InMemoryFragment and not FileFragment so there is no > concept of "filename". > That being said, the low level ScanBatchesAsync method actually returns a > generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is > a struct with the record batch as well as the source fragment for that record > batch. > So if you were to execute scan, you could inspect the fragment and, if it is > a FileFragment, you could extract the filename. > Another challenge is that R is moving towards more and more access through an > exec plan and not directly using a scanner. In order for that to work we > would need to augment the scan results with the filename in C++ before > sending into the exec plan. Luckily, we already do this a bit as well. We > currently augment the scan results with fragment index, batch index, and > whether the batch is the last batch in the fragment. > Since ExecBatch can work with constants efficiently I don't think there will > be much performance cost in always including the filename. So the work > remaining is simply to add a new augmented field _{_}fragment_source_name > which is always attached if the underlying fragment is a filename. Then users > can get this field if they want by including "{_}_fragment_source_name" in > the list of columns they query for." -- This message was sent by Atlassian Jira (v8.20.1#820001)