The same holds true in Python: Read the files with TextIO and follow with a Map operation that splits the lines into records.
This, of course, only works if you don't have newlines within your records. In that case, you may need to use a DoFn that takes as input a each filename and reads the entire file (e.g. using the standard library csv parsers), emitting the records (possibly followed by a Reshuffle), e.g. (p | beam.Create([list of filenames]) | beam.FlatMap(lambda path: csv.reader(open(path))) | beam.Reshuffle() | ...) If your files are too big to read in a single mapper *and* have newlines, you may have to implement something like https://blog.etleap.com/2016/11/27/distributed-csv-parsing/ On Sun, Nov 25, 2018 at 2:29 PM Unais T <[email protected]> wrote: > Python > > On Sun, Nov 25, 2018 at 4:54 PM Jean-Baptiste Onofré <[email protected]> > wrote: > >> Hi Unais, >> >> What SDK do you plan to use ? Java or Python ? >> >> Regarding Java, I would use directly TextIO. >> >> Regards >> JB >> >> On 25/11/2018 13:09, Unais T wrote: >> > Hey guys, >> > >> > One doubt >> > >> > I want to read a csv file from google cloud storage to Data Flow >> > which is best method? >> > >> > 1. Read csv and sync to BQ and then use BigQuerySource method >> > 2. Read from cloud storage directly to Data Flow (Is there any source >> > method for csv from cloud storage to CSV - like `ReadFromText` ) >> > >> > Whats the best way to read csv from cloud storage to Data Flow? >> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >
