This would have to flow through to the other IO wrappers as well, perhaps outputting a KV<Filename, Record>
I recently wrote an AvroIO parseAllGenericRecord() equivalent transform, because I was reading files of various schemas and needed the the parseFn to know both the filename currently being read and use some side-input... It ended up being quite complex - especially as I wanted to shard the file read, like AvroIO already does - and I basically re-implemented part of AvroIO for my use-case... @Chaim, one simpler option could be to use parseGenericRecord and use the *name* of the Avro schema in the GenericRecord as a way to determine the table name - this may mean that you have to change the way your Avro files are being written.. On Sun, 10 Feb 2019, 07:03 Reuven Lax, <re...@google.com> wrote: > I think we could definitely add an option to FileIO to add the filename to > every record. It would come at a (performance) cost - often the filename is > much larger than the actual record.. > > On Thu, Feb 7, 2019 at 6:29 AM Kenneth Knowles <k...@apache.org> wrote: > >> This comes up a lot, wanting file names alongside the data that came from >> the file. It is a historical quirk that none of our connectors used to have >> the file names. What is the change needed for FileIO + parse Avro to be >> really easy to use? >> >> Kenn >> >> On Thu, Feb 7, 2019 at 6:18 AM Jeff Klukas <jklu...@mozilla.com> wrote: >> >>> I haven't needed to do this with Beam before, but I've definitely had >>> similar needs in the past. Spark, for example, provides an input_file_name >>> function that can be applied to a dataframe to add the input file as an >>> additional column. It's not clear to me how that's implemented, though. >>> >>> Perhaps others have suggestions, but I'm not aware of a way to do this >>> conveniently in Beam today. To my knowledge, today you would have to use >>> FileIO.match() and FileIO.readMatches() to get a collection of >>> ReadableFile. You'd then have to FlatMapElements to pull out the metadata >>> and the bytes of the file, and you'd be responsible for parsing those bytes >>> into avro records. You'd be able to output something like a KV<String, T> >>> that groups the file name together with the parsed avro record. >>> >>> Seems like something worth providing better support for in Beam itself >>> if this indeed doesn't already exist. >>> >>> On Thu, Feb 7, 2019 at 7:29 AM Chaim Turkel <ch...@behalf.com> wrote: >>> >>>> Hi, >>>> I am working on a pipeline that listens to a topic on pubsub to get >>>> files that have changes in the storage. Then i read avro files, and >>>> would like to write them to bigquery based on the file name (to >>>> different tables). >>>> My problem is that the transformer that reads the avro does not give >>>> me back the files name (like a tuple or something like that). I seem >>>> to have this pattern come back a lot. >>>> Can you think of any solutions? >>>> >>>> Chaim >>>> >>>> -- >>>> >>>> >>>> Loans are funded by >>>> FinWise Bank, a Utah-chartered bank located in Sandy, >>>> Utah, member FDIC, Equal >>>> Opportunity Lender. Merchant Cash Advances are >>>> made by Behalf. For more >>>> information on ECOA, click here >>>> <https://www.behalf.com/legal/ecoa/>. For important information about >>>> opening a new >>>> account, review Patriot Act procedures here >>>> <https://www.behalf.com/legal/patriot/>. >>>> Visit Legal >>>> <https://www.behalf.com/legal/> to >>>> review our comprehensive program terms, >>>> conditions, and disclosures. >>>> >>>