Hi Eila, That seems reasonable to me.
Here is a reference on writing to BQ https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326 May I know how many distinct column are you expecting across all files? On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof < [email protected]> wrote: > Hi Ankur / users, > > I would like to make sure that the suggested pipeline can work for my > needs. > > So, additional clarification: > - The CSV files have few common and few different columns. Each CSV file > represent a sample measurements record. > > - When the CSVs merged together, I expect to have one table with combined > columns from all samples. Will your suggested pipeline allow that? > > - If I understand correctly the following pipeline: > > *Read files => Parse lines => Generate pCollections for each column => > GroupBy column name => Write to BQ* > > *Read files:* will generate a pCollection from each CSV line (=file) > *Parse Lines*: will extract the column name and its matching value > (generate a tuple) > *Generate pCollection for each column:* will generate a pCollection from > the tuple > *GroupBy: *will merge each column name with all the relevant samples > values (does it need to know the column names to group by or will it > automatically use the tuple key? ) > *WriteToBQ*: will NOT (is that true?) be able to write the values > matching the relevant sample. The samples that didnt have value for a > specific key will get values for other samples values that are available at > the column's collection record > > - If the above understanding correct, will extracting & merging the > headers from all 2.5M CSV files in advance to derived the schema and then > using an additional pipeline to populate the data to BQ with schema will be > the "right" approach? > > Please let me know if I miss something here / what your thoughts are > Many thanks, > Eila > > On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof < > [email protected]> wrote: > >> Hi Ankur, >> >> Thank you. Trying this approach now. Will let you know if I have any >> issue implementing it. >> Best, >> Eila >> >> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <[email protected]> wrote: >> >>> Hi Eila, >>> >>> If I understand correctly, the objective is to read a large number of >>> CSV files, each of which contains a single row with multiple columns. >>> Deduplicate the columns in the file and write them to BQ. >>> You are using pandas DF to deduplicate the columns for a small set of >>> files which might not work for large number of files. >>> >>> You can use beam groupBy to deduplicate the columns and write them to >>> bigquery. Beam is capable of reading and managing large number of files by >>> providing path to the directory containing those files. >>> So the approach would be -> >>> Read files => Parse lines => Generate pCollections for each column => >>> GroupBy column name => Write to BQ >>> For reference here is an example of reading file and doing groupby >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py >>> >>> Note: I am not very familiar with BQ so can't think of any direct >>> approach to dump data to BQ. >>> >>> Thanks, >>> Ankur >>> >>> >>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof < >>> [email protected]> wrote: >>> >>>> Hello, >>>> I would like to write large number of CSV file to BQ where the headers >>>> from all of them is aggregated to one common headers. any advice is very >>>> appreciated. >>>> >>>> The details are: >>>> 1. 2.5M CSV files >>>> 2. Each CSV file: header of 50-60 columns >>>> 2. Each CSV file: one data row >>>> >>>> there are common columns between the CSV file but I dont know them in >>>> advance.I would like to have all the csv files in one bigQuery table. >>>> >>>> My current method: >>>> When it was smaller amount of files, I read the csv files and appended >>>> them to one pandas dataframe that was written to a file (total.csv). >>>> total.csv was the input to the beam pipeline. >>>> >>>> small CSVs => Pandas DF => total CSV => pCollection => Big Query >>>> >>>> The challenge with that approach is that the pandas will require large >>>> memory in order to hold the 2.5M csv files before writing them to BQ. >>>> >>>> Is there a different way to pipe the CSVs to BQ? One option will be to >>>> split the CSVs to batchs and write them to different BQ tables or append to >>>> one table. >>>> >>>> Any thoughts how to do it without extra coding? >>>> >>>> Many thanks, >>>> -- >>>> Eila >>>> www.orielresearch.org >>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >>>> m/Deep-Learning-In-Production/ >>>> <https://www.meetup.com/Deep-Learning-In-Production/> >>>> >>>> >>>> >> >> -- >> Eila >> www.orielresearch.org >> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >> m/Deep-Learning-In-Production/ >> <https://www.meetup.com/Deep-Learning-In-Production/> >> >> >> > > -- > Eila > www.orielresearch.org > https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> > p.co <https://www.meetup.com/Deep-Learning-In-Production/> > m/Deep-Learning-In-Production/ > <https://www.meetup.com/Deep-Learning-In-Production/> > > >
