Thank you! Probably around 50. Best, Eila
On Thu, Sep 27, 2018 at 1:23 AM Ankur Goenka <[email protected]> wrote: > Hi Eila, > > That seems reasonable to me. > > Here is a reference on writing to BQ > https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326 > > May I know how many distinct column are you expecting across all files? > > > On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof < > [email protected]> wrote: > >> Hi Ankur / users, >> >> I would like to make sure that the suggested pipeline can work for my >> needs. >> >> So, additional clarification: >> - The CSV files have few common and few different columns. Each CSV file >> represent a sample measurements record. >> >> - When the CSVs merged together, I expect to have one table with combined >> columns from all samples. Will your suggested pipeline allow that? >> >> - If I understand correctly the following pipeline: >> >> *Read files => Parse lines => Generate pCollections for each column => >> GroupBy column name => Write to BQ* >> >> *Read files:* will generate a pCollection from each CSV line (=file) >> *Parse Lines*: will extract the column name and its matching value >> (generate a tuple) >> *Generate pCollection for each column:* will generate a pCollection from >> the tuple >> *GroupBy: *will merge each column name with all the relevant samples >> values (does it need to know the column names to group by or will it >> automatically use the tuple key? ) >> *WriteToBQ*: will NOT (is that true?) be able to write the values >> matching the relevant sample. The samples that didnt have value for a >> specific key will get values for other samples values that are available at >> the column's collection record >> >> - If the above understanding correct, will extracting & merging the >> headers from all 2.5M CSV files in advance to derived the schema and then >> using an additional pipeline to populate the data to BQ with schema will be >> the "right" approach? >> >> Please let me know if I miss something here / what your thoughts are >> Many thanks, >> Eila >> >> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof < >> [email protected]> wrote: >> >>> Hi Ankur, >>> >>> Thank you. Trying this approach now. Will let you know if I have any >>> issue implementing it. >>> Best, >>> Eila >>> >>> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <[email protected]> wrote: >>> >>>> Hi Eila, >>>> >>>> If I understand correctly, the objective is to read a large number of >>>> CSV files, each of which contains a single row with multiple columns. >>>> Deduplicate the columns in the file and write them to BQ. >>>> You are using pandas DF to deduplicate the columns for a small set of >>>> files which might not work for large number of files. >>>> >>>> You can use beam groupBy to deduplicate the columns and write them to >>>> bigquery. Beam is capable of reading and managing large number of files by >>>> providing path to the directory containing those files. >>>> So the approach would be -> >>>> Read files => Parse lines => Generate pCollections for each column => >>>> GroupBy column name => Write to BQ >>>> For reference here is an example of reading file and doing groupby >>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py >>>> >>>> Note: I am not very familiar with BQ so can't think of any direct >>>> approach to dump data to BQ. >>>> >>>> Thanks, >>>> Ankur >>>> >>>> >>>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof < >>>> [email protected]> wrote: >>>> >>>>> Hello, >>>>> I would like to write large number of CSV file to BQ where the headers >>>>> from all of them is aggregated to one common headers. any advice is very >>>>> appreciated. >>>>> >>>>> The details are: >>>>> 1. 2.5M CSV files >>>>> 2. Each CSV file: header of 50-60 columns >>>>> 2. Each CSV file: one data row >>>>> >>>>> there are common columns between the CSV file but I dont know them in >>>>> advance.I would like to have all the csv files in one bigQuery table. >>>>> >>>>> My current method: >>>>> When it was smaller amount of files, I read the csv files and appended >>>>> them to one pandas dataframe that was written to a file (total.csv). >>>>> total.csv was the input to the beam pipeline. >>>>> >>>>> small CSVs => Pandas DF => total CSV => pCollection => Big Query >>>>> >>>>> The challenge with that approach is that the pandas will require large >>>>> memory in order to hold the 2.5M csv files before writing them to BQ. >>>>> >>>>> Is there a different way to pipe the CSVs to BQ? One option will be to >>>>> split the CSVs to batchs and write them to different BQ tables or append >>>>> to >>>>> one table. >>>>> >>>>> Any thoughts how to do it without extra coding? >>>>> >>>>> Many thanks, >>>>> -- >>>>> Eila >>>>> www.orielresearch.org >>>>> https://www.meetu >>>>> <https://www.meetup.com/Deep-Learning-In-Production/>p.co >>>>> <https://www.meetup.com/Deep-Learning-In-Production/> >>>>> m/Deep-Learning-In-Production/ >>>>> <https://www.meetup.com/Deep-Learning-In-Production/> >>>>> >>>>> >>>>> >>> >>> -- >>> Eila >>> www.orielresearch.org >>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >>> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >>> m/Deep-Learning-In-Production/ >>> <https://www.meetup.com/Deep-Learning-In-Production/> >>> >>> >>> >> >> -- >> Eila >> www.orielresearch.org >> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >> m/Deep-Learning-In-Production/ >> <https://www.meetup.com/Deep-Learning-In-Production/> >> >> >> -- Eila www.orielresearch.org https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co <https://www.meetup.com/Deep-Learning-In-Production/> m/Deep-Learning-In-Production/ <https://www.meetup.com/Deep-Learning-In-Production/>
