Nicola Crane created ARROW-15281:
------------------------------------

             Summary: [C++] Implement ability to retrieve fragment filename
                 Key: ARROW-15281
                 URL: https://issues.apache.org/jira/browse/ARROW-15281
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Nicola Crane


A user has requested the ability to include the filename of the CSV in the 
dataset output - see discussion on ARROW-15260 for more context.

Relevant info from that ticket:

 
"From a C++ perspective we've got many of the pieces needed already. One 
challenge is that the datasets API is written to work with "fragments" and not 
"files". For example, a dataset might be an in-memory table in which case we 
are working with InMemoryFragment and not FileFragment so there is no concept 
of "filename".

That being said, the low level ScanBatchesAsync method actually returns a 
generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a 
struct with the record batch as well as the source fragment for that record 
batch.

So if you were to execute scan, you could inspect the fragment and, if it is a 
FileFragment, you could extract the filename.

Another challenge is that R is moving towards more and more access through an 
exec plan and not directly using a scanner. In order for that to work we would 
need to augment the scan results with the filename in C++ before sending into 
the exec plan. Luckily, we already do this a bit as well. We currently augment 
the scan results with fragment index, batch index, and whether the batch is the 
last batch in the fragment.

Since ExecBatch can work with constants efficiently I don't think there will be 
much performance cost in always including the filename. So the work remaining 
is simply to add a new augmented field _{_}fragment_source_name which is always 
attached if the underlying fragment is a filename. Then users can get this 
field if they want by including "{_}_fragment_source_name" in the list of 
columns they query for."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to