[ https://issues.apache.org/jira/browse/BEAM-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chamikara Jayalath resolved BEAM-6064. -------------------------------------- Resolution: Fixed Fix Version/s: 2.9.0 > Python BigQuery performance much worse than Java > ------------------------------------------------ > > Key: BEAM-6064 > URL: https://issues.apache.org/jira/browse/BEAM-6064 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Affects Versions: 2.8.0 > Reporter: Jan Kuipers > Assignee: Chamikara Jayalath > Priority: Major > Fix For: 2.9.0 > > Attachments: results-java.png, results-python.png > > > The performance of reading from BigQuery in Python seems to be much worse > than the performance of it in Java. > To reproduce this, I've run the following two programs on the Google Cloud, > which basically read the weights from the public data set "natality" and > outputs the top 100 largest weights. > Python: > {code:java} > # <cut imports> > options = PipelineOptions() > options.view_as(StandardOptions).runner = 'DataflowRunner' > # <cut more options> > pipeline = Pipeline(options=options) > (pipeline > | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT > weight_pounds FROM [bigquery-public-data:samples.natality]')) > | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds']) > | 'Top' >> beam.combiners.Top.Largest(100) > | 'MapToString' >> beam.Map(lambda elem: str(elem)) > | 'Write' >> beam.io.WriteToText("<output-file>")) > pipeline.run() > {code} > Java: > {code:java} > // <cut imports> > public class Natality { > public static void main(String[] args) { > DataflowPipelineOptions options = > PipelineOptionsFactory.create().as(DataflowPipelineOptions.class); > options.setRunner(DataflowRunner.class); > // <cut more options> > > Pipeline pipeline = Pipeline.create(options); > pipeline.apply("Read", BigQueryIO.readTableRows() > .fromQuery("SELECT weight_pounds FROM > [bigquery-public-data:samples.natality]")) > .apply("MapToDouble", MapElements > .into(TypeDescriptors.doubles()) > .via(row -> { > Object obj = row.get("weight_pounds"); > return (obj == null ? 0.0 : (Double) obj); > })) > .apply("Top", Top.largest(100)) > .apply("MapToString", MapElements > .into(TypeDescriptors.strings()) > .via(weight -> weight.toString())) > .apply("Write", TextIO.write().to("<output-file>")); > pipeline.run().waitUntilFinish(); > } > } > {code} > The "<cut more options>" are basic options like project, job name, temp > location, etc. Both programs produce identical outputs. > Running these programs launches a DataFlow job on the Google Cloud with the > following results (data from the Google Cloud Platform web interface; > screenshots attached). > Python: > {noformat} > Read Succeeded 1 hr 40 min 40 sec > MapToFloat Succeeded 2 min 43 sec > Top Succeeded 5 min 25 sec > MapToString Succeeded 0 sec > Write Succeeded 3 sec{noformat} > Java: > {noformat} > Read Succeeded 4 min 45 sec > MapToDouble Succeeded 45 sec > Top Succeeded 52 sec > MapToString Succeeded 0 sec > Write Succeeded 1 sec > {noformat} > As you can see, there is an enormous performance hit in Python w.r.t. the > reading from BigQuery: 1h40m vs less than 5 minutes. > Furthermore the other standard operations (like Top) are also much slower in > Python than in Java. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)