Re: Reading CSV from google cloud storage to Data Flow

Robert Bradshaw Mon, 26 Nov 2018 00:40:39 -0800

The same holds true in Python: Read the files with TextIO and follow with a
Map operation that splits the lines into records.


This, of course, only works if you don't have newlines within your records.
In that case, you may need to use a DoFn that takes as input a each
filename and reads the entire file (e.g. using the standard library csv
parsers), emitting the records (possibly followed by a Reshuffle), e.g.

(p
 | beam.Create([list of filenames])
 | beam.FlatMap(lambda path: csv.reader(open(path)))
 | beam.Reshuffle()
 | ...)

If your files are too big to read in a single mapper *and* have newlines,
you may have to implement something like
https://blog.etleap.com/2016/11/27/distributed-csv-parsing/


On Sun, Nov 25, 2018 at 2:29 PM Unais T <[email protected]> wrote:

> Python
>
> On Sun, Nov 25, 2018 at 4:54 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi Unais,
>>
>> What SDK do you plan to use ? Java or Python ?
>>
>> Regarding Java, I would use directly TextIO.
>>
>> Regards
>> JB
>>
>> On 25/11/2018 13:09, Unais T wrote:
>> > Hey guys,
>> >
>> > One doubt
>> >
>> > I want to read a csv file from google cloud storage to Data Flow
>> > which is best method?
>> >
>> > 1.   Read csv and sync to BQ and then use BigQuerySource method
>> > 2.   Read from cloud storage directly to Data Flow (Is there any source
>> > method for csv from cloud storage to CSV - like `ReadFromText` )
>> >
>> > Whats the best way to read csv from cloud storage to Data Flow?
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Reading CSV from google cloud storage to Data Flow

Reply via email to