Hi Ankur, Thank you. Trying this approach now. Will let you know if I have any issue implementing it. Best, Eila
On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <[email protected]> wrote: > Hi Eila, > > If I understand correctly, the objective is to read a large number of CSV > files, each of which contains a single row with multiple columns. > Deduplicate the columns in the file and write them to BQ. > You are using pandas DF to deduplicate the columns for a small set of > files which might not work for large number of files. > > You can use beam groupBy to deduplicate the columns and write them to > bigquery. Beam is capable of reading and managing large number of files by > providing path to the directory containing those files. > So the approach would be -> > Read files => Parse lines => Generate pCollections for each column => > GroupBy column name => Write to BQ > For reference here is an example of reading file and doing groupby > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py > > Note: I am not very familiar with BQ so can't think of any direct approach > to dump data to BQ. > > Thanks, > Ankur > > > On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof < > [email protected]> wrote: > >> Hello, >> I would like to write large number of CSV file to BQ where the headers >> from all of them is aggregated to one common headers. any advice is very >> appreciated. >> >> The details are: >> 1. 2.5M CSV files >> 2. Each CSV file: header of 50-60 columns >> 2. Each CSV file: one data row >> >> there are common columns between the CSV file but I dont know them in >> advance.I would like to have all the csv files in one bigQuery table. >> >> My current method: >> When it was smaller amount of files, I read the csv files and appended >> them to one pandas dataframe that was written to a file (total.csv). >> total.csv was the input to the beam pipeline. >> >> small CSVs => Pandas DF => total CSV => pCollection => Big Query >> >> The challenge with that approach is that the pandas will require large >> memory in order to hold the 2.5M csv files before writing them to BQ. >> >> Is there a different way to pipe the CSVs to BQ? One option will be to >> split the CSVs to batchs and write them to different BQ tables or append to >> one table. >> >> Any thoughts how to do it without extra coding? >> >> Many thanks, >> -- >> Eila >> www.orielresearch.org >> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >> m/Deep-Learning-In-Production/ >> <https://www.meetup.com/Deep-Learning-In-Production/> >> >> >> -- Eila www.orielresearch.org https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co <https://www.meetup.com/Deep-Learning-In-Production/> m/Deep-Learning-In-Production/ <https://www.meetup.com/Deep-Learning-In-Production/>
