Re: H5 potential intermediate solution

Chamikara Jayalath Mon, 02 Apr 2018 16:34:44 -0700

(moving dev to bcc)

Hi Eila,


On Mon, Apr 2, 2018 at 3:50 PM OrielResearch Eila Arich-Landkof <
[email protected]> wrote:

> Hi All,
>
> I was able to make it work by creating the PCollection with the numpy
> array. However, writing to BQ was impossible because it requested for the
> schema.
> The code:
> (p | "create all" >> beam.Create(expression[1:5,1:5])
>    | "write all text" >> beam.io.WriteToText('gs://archs4/output/',
> file_name_suffix='.txt'))
>
> *Is there a walk around for providing schema for beam.io
> <http://beam.io>.BigQuerySink?*
>

Regarding your earlier question, you do need at least one element in the
PCollection that triggers the ParDo to do any work (which can be a create
with a single element that you ignore).

Not sure if I fully understood the BigQuery question. You have to specify a
schema when writing to a new BigQuery table. See following example,

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L1085


Thanks,
Cham



>
> Many thanks,
> Eila
>
> On Mon, Apr 2, 2018 at 11:33 AM, OrielResearch Eila Arich-Landkof <
> [email protected]> wrote:
>
>> Hello all,
>>
>> I would like to try a different way to leverage Apache beam for H5 => BQ
>> (file to table transfer).
>>
>> For my use case, I would like to read every 10K rows of H5 data (numpy
>> array format), transpose them and write them to BQ 10K columns. 10K is BQ
>> columns limit.
>>
>> My code is below and fires the following error (I might have missed
>> something basic). I am not using beam.Create and trying to create a
>> PCollection from the ParDo transfer. is this posssible? if not, what is the
>> alternative for creating a PColleciton from numpy array? (if any)
>>
>> ERROR:root:Exception at bundle 
>> <apache_beam.runners.direct.bundle_factory._Bundle object at 
>> 0x7f00aad7b7a0>, due to an exception.
>>  Traceback (most recent call last):
>>   File 
>> "/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>>  line 307, in call
>>     side_input_values)
>>   File 
>> "/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>>  line 332, in attempt_call
>>     evaluator.start_bundle()
>>   File 
>> "/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>>  line 540, in start_bundle
>>     self._applied_ptransform.inputs[0].windowing,
>> AttributeError: 'PBegin' object has no attribute 'windowing'
>>
>> ERROR:root:Giving up after 4 attempts.
>> WARNING:root:A task failed with exception: 'PBegin' object has no attribute 
>> 'windowing'
>> WARNING:root:A task failed with exception: 'PBegin' object has no attribute 
>> 'windowing'
>>
>>
>>
>> *Code:*
>>
>> options = PipelineOptions()
>> google_cloud_options = options.view_as(GoogleCloudOptions)
>> google_cloud_options.project = 'orielresearch-188115'
>> google_cloud_options.job_name = 'h5-to-bq-10K'
>> google_cloud_options.staging_location = 'gs://archs4/staging'
>> google_cloud_options.temp_location = 'gs://archs4/temp'
>> options.view_as(StandardOptions).runner = 'DirectRunner'
>>
>> p = beam.Pipeline(options=options)
>>
>> class read10kRowsDoFn(beam.DoFn):
>>   def process(self, element,index):
>>     print(index)
>>     row_start = index
>>     row_end = index+10000
>>     # returns numpy array - numpy.ndarray
>>     d = expression[row_start,row_end,:]
>>     np.transpose(d)
>>     return(d)
>>
>> #for i in range(0,expression.shape[0],10000):
>> k=210 # allows creating unique labels for the runner
>> for i in range(0,3,2): # test
>>   k+=1
>>
>> bigQuery_dataset_table_name=bigquery_dataset_name+'.'+bigquery_table_name+str(k)
>>   print(bigQuery_dataset_table_name)
>>   label_read_row = "read "+bigQuery_dataset_table_name
>>   label_write_col = "write "+bigQuery_dataset_table_name
>> *# is this possible to generate a PCollection with ParDo without create?*
>>   (p | label_read_row >> beam.ParDo(read10kRowsDoFn(i))
>>      | label_write_col >> beam.io.Write(beam.io
>> .BigQuerySink(bigQuery_dataset_table_name)))
>>
>> p.run().wait_until_finish()* #fires an error*
>>
>> Many thanks,
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetup.com/Deep-Learning-In-Production/
>>
>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetup.com/Deep-Learning-In-Production/
>

Re: H5 potential intermediate solution

Reply via email to