Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-28 Thread Ziyad Muhammed
Hi Eila, I'm not sure if I understand the complexity of your problem. If you do not have to perform any transformation on the data inside CSVs and just need to load them to Bigquery, isn't it enough to use bqload with schema autodetect ? https://cloud.google.com/bigquery/docs/schema-detect Best Z

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-27 Thread OrielResearch Eila Arich-Landkof
Thank you! Probably around 50. Best, Eila On Thu, Sep 27, 2018 at 1:23 AM Ankur Goenka wrote: > Hi Eila, > > That seems reasonable to me. > > Here is a reference on writing to BQ > https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/comp

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread Ankur Goenka
Hi Eila, That seems reasonable to me. Here is a reference on writing to BQ https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326 May I know how many distinct column are you expecting across all files? On

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread OrielResearch Eila Arich-Landkof
Hi Ankur / users, I would like to make sure that the suggested pipeline can work for my needs. So, additional clarification: - The CSV files have few common and few different columns. Each CSV file represent a sample measurements record. - When the CSVs merged together, I expect to have one tabl

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread OrielResearch Eila Arich-Landkof
Hi Ankur, Thank you. Trying this approach now. Will let you know if I have any issue implementing it. Best, Eila On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka wrote: > Hi Eila, > > If I understand correctly, the objective is to read a large number of CSV > files, each of which contains a single

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread Ankur Goenka
Hi Eila, If I understand correctly, the objective is to read a large number of CSV files, each of which contains a single row with multiple columns. Deduplicate the columns in the file and write them to BQ. You are using pandas DF to deduplicate the columns for a small set of files which might not

Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread OrielResearch Eila Arich-Landkof
Hello, I would like to write large number of CSV file to BQ where the headers from all of them is aggregated to one common headers. any advice is very appreciated. The details are: 1. 2.5M CSV files 2. Each CSV file: header of 50-60 columns 2. Each CSV file: one data row there are common columns