[ https://issues.apache.org/jira/browse/BEAM-4796?focusedWorklogId=144873&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-144873 ]
ASF GitHub Bot logged work on BEAM-4796: ---------------------------------------- Author: ASF GitHub Bot Created on: 17/Sep/18 15:22 Start Date: 17/Sep/18 15:22 Worklog Time Spent: 10m Work Description: nielm opened a new pull request #6409: [BEAM-4796] SpannerIO: Add option to wait for Schema to be ready. URL: https://github.com/apache/beam/pull/6409 Current behavior waits for the entire input PCollection to be read and closed before reading the Schema. This can delay the pipeline for large inputs, and does not guarantee that the schema is ready (if it is created in the same pipeline) for small inputs. It also breaks streaming mode completely as the input PCollection is never closed. This PR adds an optional parameter with a PCollection to wait for before reading the schema. If not specified, the schema is read immediately. This provides a partial -- but not complete -- fix for streaming mode (there are still issues with the partitioning/grouping in streaming mode which means that NPE's will be thrown with more than trivial load). @chamikaramj Post-Commit Tests Status (on master branch) ------------------------------------------------------------------------------------------------ Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/) | --- | --- | --- | --- | --- | --- Java | [](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/) | [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/) Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/) | --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/) </br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | --- | --- | --- | --- ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 144873) Time Spent: 10m Remaining Estimate: 0h > SpannerIO waits for all input before writing > -------------------------------------------- > > Key: BEAM-4796 > URL: https://issues.apache.org/jira/browse/BEAM-4796 > Project: Beam > Issue Type: Bug > Components: io-java-gcp > Affects Versions: 2.5.0 > Reporter: Niel Markwick > Assignee: Chamikara Jayalath > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > SpannerIO.Write waits for all input in the window to arrive before getting > the schema: > [https://github.com/apache/beam/blame/release-2.5.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L841] > > In streaming mode, this is not an issue, but in batch mode, this causes the > pipeline to stall until all input is read, which could be a significant > amount of time (and temp data). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)