Hi Cham,

Please see inline. If possible, code / pseudo code will help a lot.
Thanks,
Eila

On Tue, Mar 20, 2018 at 1:15 PM, Chamikara Jayalath <chamik...@google.com>
wrote:

> Hi Eila,
>
> Please find my comments inline.
>
> On Tue, Mar 20, 2018 at 8:02 AM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hello all,
>>
>> It was nice to meet you last week!!!
>>
>>
> It was nice to meet you as well :)
>
>
>> I am writing genomic pCollection that is created from bigQuery to a
>> folder. Following is the code with output so you can run it with any small
>> BQ table and let me know what your thoughts are:
>>
>> This init is only for debugging. In production I will use the pipeline
syntax

> rows = [{u'index': u'GSM2313641', u'SNRPCP14': 0},{u'index':
>> u'GSM2316666', u'SNRPCP14': 0},{u'index': u'GSM2312355', u'SNRPCP14':
>> 0},{u'index': u'GSM2312372', u'SNRPCP14': 0}]
>>
>> rows[1].keys()
>> # output:  [u'index', u'SNRPCP14']
>>
>> # you can change `archs4.results_20180308_ to any other table name with
>> index column
>> queries2 = rows | beam.Map(lambda x: (beam.io.Read(beam.io.
>> BigQuerySource(project='orielresearch-188115', use_standard_sql=False,
>> query=str('SELECT * FROM `archs4.results_20180308_*` where index=\'%s\'' %
>> (x["index"])))),
>>                                str('gs://archs4/output/'+x["
>> index"]+'/')))
>>
>
> I don't think above code will work (not portable across runners at least).
> BigQuerySource (along with Read transform) have to be applied to a Pipeline
> object. So probably change this to a for loop that creates a set of read
> transforms and use Flatten to create a single PCollection.
>
For debug, I am running on the local datalab runner. For the production, I
will be running only dataflow runner. I think that I was able to query the
tables that way, I will double check it. The indexes could go to millions -
my concern is that I will not be able to leverage on Beam distribution
capability when I use the the loop option. Any thoughts on that?

>
>
>>
>> queries2
>> # output: a list of pCollection and the path to write the pCollection
>> data to
>>
>> [(<Read(PTransform) label=[Read] at 0x7fa6990fb7d0>,
>>   'gs://archs4/output/GSM2313641/'),
>>  (<Read(PTransform) label=[Read] at 0x7fa6990fb950>,
>>   'gs://archs4/output/GSM2316666/'),
>>  (<Read(PTransform) label=[Read] at 0x7fa6990fb9d0>,
>>   'gs://archs4/output/GSM2312355/'),
>>  (<Read(PTransform) label=[Read] at 0x7fa6990fbb50>,
>>   'gs://archs4/output/GSM2312372/')]
>>
>>
> What you got here is a PCollection of PTransform objects which is not
> useful.
>
>
>>
>> *# this is my challenge*
>> queries2 | 'write to relevant path' >> beam.io.WriteToText("SECOND
>> COLUMN")
>>
>>
> Once you update above code you will get a proper PCollection of elements
> read from BigQuery. You can transform and write this (to files, BQ, or any
> other sink) as needed.
>

it is a list of tupples with PCollection and the path to write to. the path
is not unique and I might have more than one PCollection written to the
same destination. How do I pass the path from the tupple list as a
parameter to the text file name? Could you please add the code that you
were thinking about?

> Please see programming guide on how to write to text files (section 5.3
> and click Python tab): https://beam.apache.org/documentation/programming-
> guide/
>
> Thanks,
> Cham
>
>
>> Do you have any idea how to sink the data to a text file? I have tried
>> few other options and was stuck at the write transform
>>
>> Any advice is very appreciated.
>>
>> Thanks,
>> Eila
>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetup.com/Deep-Learning-In-Production/
>>
>


-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/

Reply via email to