Re: Returning dataframe from parDo and printing its value - advice?

OrielResearch Eila Arich-Landkof Tue, 19 Jun 2018 07:29:08 -0700

Thanks!!!

On Mon, Jun 18, 2018 at 4:41 PM, Chamikara Jayalath <chamik...@google.com>
wrote:


> A ParDo should always return an iterator not a string. So if you want to
> output a single string it should either be "return [str]" or "yield str".
>
>
> On Mon, Jun 18, 2018 at 1:39 PM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Thanks for the response.
>> I tried this within the current parDo, CreateColForSampleFn, Apache beam
>> returns a warning with recommendation not to return a string.
>>
>> So, my questions are:
>> - Is it essential to separate this transformation in a different ParDo?
>> - Should I ignore that message? When is this message relevant?
>>
>> Many thanks,
>> Eila
>>
>> On Mon, Jun 18, 2018 at 2:52 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> User is the correct mailing list.
>>>
>>> beam.io.WriteToText takes 'strings' which means that you have to format
>>> the whole line yourself. You'll want to apply another ParDo
>>> after CreateColForSampleFn which takes the 1x164 record and concatenates
>>> each value with ',' in between.
>>>
>>> On Mon, Jun 18, 2018 at 9:00 AM OrielResearch Eila Arich-Landkof <
>>> e...@orielresearch.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is anyone listening on the user@ mailing list? or should I use a
>>>> different mailing list?
>>>>
>>>> I have made some progress.
>>>> - ParDo returns a List now
>>>> - add a header to the WriteToText.
>>>>
>>>> The pipeline looks like that:
>>>> ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(
>>>> beam.io.BigQuerySource('archs4.Debug_annotation'))
>>>>                 | "create more columns" >> beam.ParDo(
>>>> CreateColForSampleFn(colListSubset,outputPath)))
>>>>
>>>> (ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://
>>>> dataExploration.txt',file_name_suffix='.csv',num_shards=
>>>> 1,append_trailing_newlines=True,header=colListStr))
>>>>
>>>>
>>>> The remaining issue is that the output has new line after each value:
>>>>
>>>> *None
>>>> None
>>>> None
>>>> None
>>>> None
>>>>  30
>>>>  Primary Tissue
>>>> None
>>>> None
>>>> None*
>>>>
>>>> Please let me know how do I get read from this new lines. I hope to be 
>>>> able to open the output file with Google Sheet.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Eila
>>>>
>>>>
>>>>
>>>> On Fri, Jun 15, 2018 at 2:45 PM, OrielResearch Eila Arich-Landkof <
>>>> e...@orielresearch.org> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am running a pipeline, where a table from BQ is being processed line
>>>>> by line using ParDo function.
>>>>> CreateColForSampleFn generates a data frame, with headers and values
>>>>> (shape: 1x164 ) that I want to pass to WriteToText.
>>>>> See the followings:
>>>>>
>>>>> ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(
>>>>> beam.io.BigQuerySource('archs4.Debug_annotation'))
>>>>>                 | "create more columns" >> beam.ParDo(
>>>>> CreateColForSampleFn(colListSubset,outputPath)))
>>>>>
>>>>> (ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://
>>>>> dataExploration.txt',num_shards=1))
>>>>>
>>>>> My questions are related to the returned DF and WriteToText:
>>>>> 1. when I pass DF from the CreateColForSampleFn to WriteToText , I
>>>>> get only the headers:
>>>>>
>>>>> Sample_contact_phone
>>>>> Sample_extract_protocol_ch1
>>>>> Sample_platform_id
>>>>> Sick
>>>>> Sample_title
>>>>> index
>>>>> Sample_last_update_date
>>>>> Sample_contact_country
>>>>> Sample_channel_count
>>>>> Sample_library_source
>>>>> Sample_taxid_ch1
>>>>>
>>>>>
>>>>> 2. When I return the df in a list [df], I get the following txt for
>>>>> each row (including the dimensions)
>>>>>
>>>>>  Sample_contact_phone                        Sample_extract_protocol_ch1 
>>>>> Sample_platform_id  Sick
>>>>>
>>>>> 0                       Library construction protocol: Four Âµg of 
>>>>> tota...           GPL11154  None
>>>>>
>>>>> [1 rows x 168 columns]
>>>>>
>>>>>
>>>>>
>>>>> I want to generate a text file that includes:
>>>>> - One header (if needed, I will add it after the pipeline completed)
>>>>> - All the values from each rows that was processed and generated DF
>>>>> - Full cell values, without ... in the middle
>>>>>
>>>>> What am I missing? any advice?
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> Eila
>>>>> www.orielresearch.org
>>>>> https://www.meetu
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>p.co
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-
>>>>> Learning-In-Production/
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Eila
>>>> www.orielresearch.org
>>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-
>>>> Learning-In-Production/
>>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>>
>>>>
>>>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>m/Deep-
>> Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>


-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Re: Returning dataframe from parDo and printing its value - advice?

Reply via email to