jonuchauhan opened a new issue, #35773: URL: https://github.com/apache/beam/issues/35773
### What would you like to happen? **Is your feature request related to a problem? Please describe.** While using the `ReadFromCsv` YAML transform in Apache Beam, I noticed that there is no way to access the source file name for each record. For data lineage, debugging, or downstream processing, having the originating file name attached to each row is very helpful. **Describe the solution you'd like** A parameter or option in the YAML transform, such as `include_filename: true`, which would add a field (e.g., `filename`) to each output row, containing the name or path of the source CSV file. **Describe alternatives you've considered** - Reading files separately and adding the file name manually via additional steps in YAML, but this doesn't scale for many files. - Using the Python SDK with a custom `DoFn`, which is more flexible, but less declarative and harder to maintain compared to YAML. **Additional context** - This is especially useful in data lakes or pipelines with multiple files as sources. - Similar features are supported in other data processing frameworks (like Spark's input_file_name()). **Links** - [ReadFromCsv YAML transform docs](https://beam.apache.org/releases/yamldoc/current/#readfromcsv) - [Related StackOverflow question](https://stackoverflow.com/questions/46946191/apache-beam-get-file-name-while-reading-files) ### Issue Priority Priority: 2 (default / most feature requests should be filed as P2) ### Issue Components - [ ] Component: Python SDK - [ ] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [x] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Infrastructure - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
