[ 
https://issues.apache.org/jira/browse/BEAM-9228?focusedWorklogId=389236&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-389236
 ]

ASF GitHub Bot logged work on BEAM-9228:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Feb/20 00:49
            Start Date: 19/Feb/20 00:49
    Worklog Time Spent: 10m 
      Work Description: robertwb commented on pull request #10847: [BEAM-9228] 
Support further partition for FnApi ListBuffer
URL: https://github.com/apache/beam/pull/10847#discussion_r381019006
 
 

 ##########
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##########
 @@ -932,10 +998,14 @@ def get_buffer(buffer_id):
       return pcoll_buffers[buffer_id]
 
     def get_input_coder_impl(transform_id):
-      return context.coders[
-          safe_coders[beam_fn_api_pb2.RemoteGrpcPort.FromString(
-              process_bundle_descriptor.transforms[transform_id].spec.payload).
-                      coder_id]].get_impl()
+      coder_id = beam_fn_api_pb2.RemoteGrpcPort.FromString(
+          process_bundle_descriptor.transforms[transform_id].spec.payload).\
+        coder_id
+      assert coder_id is not None and coder_id != ''
 
 Review comment:
   OK, in the SDF case we can get the coder of the preceding transform (the one 
that produces its input `ref_PCollection_PCollection_3_split`) which should 
always be a RemoteGrpcPort. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 389236)
    Time Spent: 2h 10m  (was: 2h)

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> --------------------------------------------------------------------
>
>                 Key: BEAM-9228
>                 URL: https://issues.apache.org/jira/browse/BEAM-9228
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.16.0, 2.18.0, 2.19.0
>            Reporter: Hannah Jiang
>            Assignee: Hannah Jiang
>            Priority: Major
>             Fix For: 2.20.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -------------------------------------------------
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>           default_environment=beam_runner_api_pb2.Environment(
>               urn=python_urns.SUBPROCESS_SDK,
>               payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
>                         % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>                  | beam.combiners.Count.Globally()
>                  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> ----------------------------------------------------------------------------------------------------------------------------------------------
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to