Hi Ankur,

Thank you. Trying this approach now. Will let you know if I have any issue
implementing it.
Best,
Eila

On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <[email protected]> wrote:

> Hi Eila,
>
> If I understand correctly, the objective is to read a large number of CSV
> files, each of which contains a single row with multiple columns.
> Deduplicate the columns in the file and write them to BQ.
> You are using pandas DF to deduplicate the columns for a small set of
> files which might not work for large number of files.
>
> You can use beam groupBy to deduplicate the columns and write them to
> bigquery. Beam is capable of reading and managing large number of files by
> providing path to the directory containing those files.
> So the approach would be ->
> Read files => Parse lines => Generate pCollections for each column =>
> GroupBy column name => Write to BQ
> For reference here is an example of reading file and doing groupby
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>
> Note: I am not very familiar with BQ so can't think of any direct approach
> to dump data to BQ.
>
> Thanks,
> Ankur
>
>
> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
> [email protected]> wrote:
>
>> Hello,
>> I would like to write large number of CSV file to BQ where the headers
>> from all of them is aggregated to one common headers. any advice is very
>> appreciated.
>>
>> The details are:
>> 1. 2.5M CSV files
>> 2. Each CSV file: header of 50-60 columns
>> 2. Each CSV file: one data row
>>
>> there are common columns between the CSV file but I dont know them in
>> advance.I would like to have all the csv files in one bigQuery table.
>>
>> My current method:
>> When it was smaller amount of files, I read the csv files and appended
>> them to one pandas dataframe that was written to a file (total.csv).
>> total.csv was the input to the beam pipeline.
>>
>> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>>
>> The challenge with that approach is that the pandas will require large
>> memory in order to hold the 2.5M csv files before writing them to BQ.
>>
>> Is there a different way to pipe the CSVs to BQ? One option will be to
>> split the CSVs to batchs and write them to different BQ tables or append to
>> one table.
>>
>> Any thoughts how to do it without extra coding?
>>
>> Many thanks,
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>> m/Deep-Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Reply via email to