Hello, I would like to write large number of CSV file to BQ where the headers from all of them is aggregated to one common headers. any advice is very appreciated.
The details are: 1. 2.5M CSV files 2. Each CSV file: header of 50-60 columns 2. Each CSV file: one data row there are common columns between the CSV file but I dont know them in advance.I would like to have all the csv files in one bigQuery table. My current method: When it was smaller amount of files, I read the csv files and appended them to one pandas dataframe that was written to a file (total.csv). total.csv was the input to the beam pipeline. small CSVs => Pandas DF => total CSV => pCollection => Big Query The challenge with that approach is that the pandas will require large memory in order to hold the 2.5M csv files before writing them to BQ. Is there a different way to pipe the CSVs to BQ? One option will be to split the CSVs to batchs and write them to different BQ tables or append to one table. Any thoughts how to do it without extra coding? Many thanks, -- Eila www.orielresearch.org https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co <https://www.meetup.com/Deep-Learning-In-Production/> m/Deep-Learning-In-Production/ <https://www.meetup.com/Deep-Learning-In-Production/>
