jonuchauhan opened a new issue, #35773:
URL: https://github.com/apache/beam/issues/35773

   ### What would you like to happen?
   
   **Is your feature request related to a problem? Please describe.**
   While using the `ReadFromCsv` YAML transform in Apache Beam, I noticed that 
there is no way to access the source file name for each record. For data 
lineage, debugging, or downstream processing, having the originating file name 
attached to each row is very helpful.
   
   **Describe the solution you'd like**
   A parameter or option in the YAML transform, such as `include_filename: 
true`, which would add a field (e.g., `filename`) to each output row, 
containing the name or path of the source CSV file.
   
   **Describe alternatives you've considered**
   - Reading files separately and adding the file name manually via additional 
steps in YAML, but this doesn't scale for many files.
   - Using the Python SDK with a custom `DoFn`, which is more flexible, but 
less declarative and harder to maintain compared to YAML.
   
   **Additional context**
   - This is especially useful in data lakes or pipelines with multiple files 
as sources.
   - Similar features are supported in other data processing frameworks (like 
Spark's input_file_name()).
   
   **Links**
   - [ReadFromCsv YAML transform 
docs](https://beam.apache.org/releases/yamldoc/current/#readfromcsv)
   - [Related StackOverflow 
question](https://stackoverflow.com/questions/46946191/apache-beam-get-file-name-while-reading-files)
   
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [x] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to