kennknowles opened a new issue, #19239:
URL: https://github.com/apache/beam/issues/19239

   The performance of reading from BigQuery in Python seems to be much worse 
than the performance of it in Java.
   
   To reproduce this, I've run the following two programs on the Google Cloud, 
which basically read the weights from the public data set "natality" and 
outputs the top 100 largest weights.
   
   Python:
   ```
   
   # <cut imports>
   
   options = PipelineOptions()
   options.view_as(StandardOptions).runner = 'DataflowRunner'
   #
   <cut more options>
   
   pipeline = Pipeline(options=options)
   (pipeline
       | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT
   weight_pounds FROM [bigquery-public-data:samples.natality]'))
       | 'MapToFloat' >> beam.Map(lambda
   elem: elem['weight_pounds'])
       | 'Top' >> beam.combiners.Top.Largest(100)
       | 'MapToString' >>
   beam.Map(lambda elem: str(elem))
       | 'Write' >> beam.io.WriteToText("<output-file>"))
   
   pipeline.run()
   
   ```
   
    Java:
   ```
   
   // <cut imports>
   
   public class Natality {
       public static void main(String[] args) {
          
   DataflowPipelineOptions options = 
PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
   
          options.setRunner(DataflowRunner.class);
           // <cut more options>
           
           Pipeline
   pipeline = Pipeline.create(options);
   
           pipeline.apply("Read", BigQueryIO.readTableRows()
   
              .fromQuery("SELECT weight_pounds FROM 
[bigquery-public-data:samples.natality]"))
         
        .apply("MapToDouble", MapElements
                   .into(TypeDescriptors.doubles())
            
         .via(row -> {
                        Object obj = row.get("weight_pounds");
                    
      return (obj == null ? 0.0 : (Double) obj);
                   }))
               .apply("Top", Top.largest(100))
   
              .apply("MapToString", MapElements
                   .into(TypeDescriptors.strings())
      
               .via(weight -> weight.toString()))
               .apply("Write", TextIO.write().to("<output-file>"));
   
   
          pipeline.run().waitUntilFinish();
       }
   }
   
   ```
   
   The "<cut more options\>" are basic options like project, job name, temp 
location, etc. Both programs produce identical outputs.
   
   Running these programs launches a DataFlow job on the Google Cloud with the 
following results (data from the Google Cloud Platform web interface; 
screenshots attached).
   
   Python:
   ```
   
   Read Succeeded 1 hr 40 min 40 sec
   MapToFloat Succeeded 2 min 43 sec
   Top Succeeded 5 min 25 sec
   MapToString
   Succeeded 0 sec
   Write Succeeded 3 sec
   ```
   
   Java:
   ```
   
   Read Succeeded 4 min 45 sec
   MapToDouble Succeeded 45 sec
   Top Succeeded 52 sec
   MapToString Succeeded
   0 sec
   Write Succeeded 1 sec
   
   ```
   
   As you can see, there is an enormous performance hit in Python w.r.t. the 
reading from BigQuery: 1h40m vs less than 5 minutes.
   
   Furthermore the other standard operations (like Top) are also much slower in 
Python than in Java.
   
    
   
   Imported from Jira 
[BEAM-6064](https://issues.apache.org/jira/browse/BEAM-6064). Original Jira may 
contain additional context.
   Reported by: jankuipers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to