[ https://issues.apache.org/jira/browse/BEAM-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937806#comment-15937806 ]
Mike Lambert edited comment on BEAM-1787 at 3/23/17 6:22 AM: ------------------------------------------------------------- Nope, there is just a print statement at step 3. Using the datastore_wordcount example: {noformat} lines = p | 'read from datastore' >> ReadFromDatastore( project, query, user_options.namespace) # Count the occurrences of each word. counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x: (x, 1)) | 'group' >> beam.GroupByKey() | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones)))) {noformat} If I put logging statements inside WordExtractingDoFn, they are not printed until the very end of the script execution, and are printed all at once. was (Author: mlambert): Nope, there is just a print statement at step 3. Using the datastore_wordcount example: {code:none} lines = p | 'read from datastore' >> ReadFromDatastore( project, query, user_options.namespace) # Count the occurrences of each word. counts = (lines | 'split' >> (beam.ParDo(WordExtractingDoFn()) .with_output_types(unicode)) | 'pair_with_one' >> beam.Map(lambda x: (x, 1)) | 'group' >> beam.GroupByKey() | 'count' >> beam.Map(lambda (word, ones): (word, sum(ones)))) {code:none} If I put logging statements inside WordExtractingDoFn, they are not printed until the very end of the script execution, and are printed all at once. > Python DirectRunner silently blocks reading full query from Google Datastore > ---------------------------------------------------------------------------- > > Key: BEAM-1787 > URL: https://issues.apache.org/jira/browse/BEAM-1787 > Project: Beam > Issue Type: Bug > Components: sdk-py > Reporter: Mike Lambert > Assignee: Ahmet Altay > Priority: Minor > Labels: datastore, python > > When I run a query (even with many splits) against the production datastore > (such as in the datastore_wordcount demo), it operates as follows: > 1. split the query into a bunch of split queries > 2. run each split query, collecting the results > 3. then pass the results to the following stage / ParDo > However, 2 is run to completion with DirectRunner before starting 3. So a > large dataset must be fully downloaded before it attempts to run any of the > following stages. > While it may make sense and local parallelism/pipelining might be > impossible....there is no output or status messages. And debugging why my > code appeared to hang before processing results, took forever to dig through > code and instrument-log-debug all the beam code to figure out what was going > on. > See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for > more details > This happens with github head 0.7.0-dev (there was no "version" tag for this > above). -- This message was sent by Atlassian JIRA (v6.3.15#6346)