Hi Eila, If I understand correctly, the objective is to read a large number of CSV files, each of which contains a single row with multiple columns. Deduplicate the columns in the file and write them to BQ. You are using pandas DF to deduplicate the columns for a small set of files which might not work for large number of files.
You can use beam groupBy to deduplicate the columns and write them to bigquery. Beam is capable of reading and managing large number of files by providing path to the directory containing those files. So the approach would be -> Read files => Parse lines => Generate pCollections for each column => GroupBy column name => Write to BQ For reference here is an example of reading file and doing groupby https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py Note: I am not very familiar with BQ so can't think of any direct approach to dump data to BQ. Thanks, Ankur On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof < [email protected]> wrote: > Hello, > I would like to write large number of CSV file to BQ where the headers > from all of them is aggregated to one common headers. any advice is very > appreciated. > > The details are: > 1. 2.5M CSV files > 2. Each CSV file: header of 50-60 columns > 2. Each CSV file: one data row > > there are common columns between the CSV file but I dont know them in > advance.I would like to have all the csv files in one bigQuery table. > > My current method: > When it was smaller amount of files, I read the csv files and appended > them to one pandas dataframe that was written to a file (total.csv). > total.csv was the input to the beam pipeline. > > small CSVs => Pandas DF => total CSV => pCollection => Big Query > > The challenge with that approach is that the pandas will require large > memory in order to hold the 2.5M csv files before writing them to BQ. > > Is there a different way to pipe the CSVs to BQ? One option will be to > split the CSVs to batchs and write them to different BQ tables or append to > one table. > > Any thoughts how to do it without extra coding? > > Many thanks, > -- > Eila > www.orielresearch.org > https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> > p.co <https://www.meetup.com/Deep-Learning-In-Production/> > m/Deep-Learning-In-Production/ > <https://www.meetup.com/Deep-Learning-In-Production/> > > >
