[ 
https://issues.apache.org/jira/browse/ARROW-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-15281.
---------------------------------
    Fix Version/s: 8.0.0
       Resolution: Fixed

Issue resolved by pull request 12560
[https://github.com/apache/arrow/pull/12560]

> [C++] Implement ability to retrieve fragment filename
> -----------------------------------------------------
>
>                 Key: ARROW-15281
>                 URL: https://issues.apache.org/jira/browse/ARROW-15281
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Assignee: Sanjiban Sengupta
>            Priority: Major
>              Labels: dataset, pull-request-available, query-engine
>             Fix For: 8.0.0
>
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> A user has requested the ability to include the filename of the CSV in the 
> dataset output - see discussion on ARROW-15260 for more context.
> Relevant info from that ticket:
>  
> "From a C++ perspective we've got many of the pieces needed already. One 
> challenge is that the datasets API is written to work with "fragments" and 
> not "files". For example, a dataset might be an in-memory table in which case 
> we are working with InMemoryFragment and not FileFragment so there is no 
> concept of "filename".
> That being said, the low level ScanBatchesAsync method actually returns a 
> generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is 
> a struct with the record batch as well as the source fragment for that record 
> batch.
> So if you were to execute scan, you could inspect the fragment and, if it is 
> a FileFragment, you could extract the filename.
> Another challenge is that R is moving towards more and more access through an 
> exec plan and not directly using a scanner. In order for that to work we 
> would need to augment the scan results with the filename in C++ before 
> sending into the exec plan. Luckily, we already do this a bit as well. We 
> currently augment the scan results with fragment index, batch index, and 
> whether the batch is the last batch in the fragment.
> Since ExecBatch can work with constants efficiently I don't think there will 
> be much performance cost in always including the filename. So the work 
> remaining is simply to add a new augmented field _{_}fragment_source_name 
> which is always attached if the underlying fragment is a filename. Then users 
> can get this field if they want by including "{_}_fragment_source_name" in 
> the list of columns they query for."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to