[jira] [Commented] (BEAM-758) Per-step, per-execution nonce
[ https://issues.apache.org/jira/browse/BEAM-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736212#comment-15736212 ] Sam McVeety commented on BEAM-758: -- Here's a sketch of how this could work, with default behavior that specific runners can override: PipelineOptions { ... @Default.InstanceFactory(/* Class that returns StaticValueProvider.of(UUID), for UUID generated at construction time */) ValueProvider getPipelineId(); ... } and a specific Runner can override this in fromOptions(), calling setPipelineId(RuntimeValueProvider rvp), with a Runner-specific RVP. > Per-step, per-execution nonce > - > > Key: BEAM-758 > URL: https://issues.apache.org/jira/browse/BEAM-758 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: Not applicable >Reporter: Daniel Halperin >Assignee: Sam McVeety > > In the forthcoming runner API, a user will be able to save a pipeline to JSON > and then run it repeatedly. > Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random > value (nonce). These values are typically generated at apply time, so that > they are deterministic (don't change across retries of DoFns) and global (are > the same across all workers). > However, once the runner API lands the existing code would result in the same > nonce being reused across jobs. Other possible solutions: > * Generate nonce in {{Create(1) | ParDo}} then use this as a side input. > Should work, as along as side inputs are actually checkpointed. But does not > work for {{BoundedSource}}. > * If a nonce is only needed for the lifetime of one bundle, can be generated > in {{startBundle}} and used in {{finishBundle}} [or {{tearDown}}]. > * Add some context somewhere that lets user code access unique step name, and > somehow generate a nonce consistently e.g. by hashing. Will usually work, but > this is similarly not available to sources. > Another Q: I'm not sure we have a good way to generate nonces in unbounded > pipelines -- we probably need one. This would enable us to, e.g., use > {{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once > triggering per window]. Or generalizing to multiple firings... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-758) Per-step, per-execution nonce
[ https://issues.apache.org/jira/browse/BEAM-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15735995#comment-15735995 ] Sam McVeety commented on BEAM-758: -- The runner-contingent option seems viable. Something like: PipelineOptions.getPipelineId(), which would default to a nonce generated at construction time, but which a given runner could override seems doable with the ValueProvider architecture. > Per-step, per-execution nonce > - > > Key: BEAM-758 > URL: https://issues.apache.org/jira/browse/BEAM-758 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Affects Versions: Not applicable >Reporter: Daniel Halperin > > In the forthcoming runner API, a user will be able to save a pipeline to JSON > and then run it repeatedly. > Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random > value (nonce). These values are typically generated at apply time, so that > they are deterministic (don't change across retries of DoFns) and global (are > the same across all workers). > However, once the runner API lands the existing code would result in the same > nonce being reused across jobs. Other possible solutions: > * Generate nonce in {{Create(1) | ParDo}} then use this as a side input. > Should work, as along as side inputs are actually checkpointed. But does not > work for {{BoundedSource}}. > * If a nonce is only needed for the lifetime of one bundle, can be generated > in {{startBundle}} and used in {{finishBundle}} [or {{tearDown}}]. > * Add some context somewhere that lets user code access unique step name, and > somehow generate a nonce consistently e.g. by hashing. Will usually work, but > this is similarly not available to sources. > Another Q: I'm not sure we have a good way to generate nonces in unbounded > pipelines -- we probably need one. This would enable us to, e.g., use > {{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once > triggering per window]. Or generalizing to multiple firings... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1071) Support pre-existing tables with streaming BigQueryIO
[ https://issues.apache.org/jira/browse/BEAM-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15718777#comment-15718777 ] Sam McVeety commented on BEAM-1071: --- [~dhalp...@google.com], the option I like the most is allowing Unbounded sources to have CREATE_NEVER as a create disposition. What do you think about that? > Support pre-existing tables with streaming BigQueryIO > - > > Key: BEAM-1071 > URL: https://issues.apache.org/jira/browse/BEAM-1071 > Project: Beam > Issue Type: Improvement > Components: sdk-java-gcp >Reporter: Sam McVeety >Priority: Minor > > Specifically, with a tableRef function, CREATE_NEVER should be allowed for > pre-existing tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1071) Support pre-existing tables with streaming BigQueryIO
Sam McVeety created BEAM-1071: - Summary: Support pre-existing tables with streaming BigQueryIO Key: BEAM-1071 URL: https://issues.apache.org/jira/browse/BEAM-1071 Project: Beam Issue Type: Improvement Reporter: Sam McVeety Priority: Minor Specifically, with a tableRef function, CREATE_NEVER should be allowed for pre-existing tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-551) Support Dynamic PipelineOptions
Sam McVeety created BEAM-551: Summary: Support Dynamic PipelineOptions Key: BEAM-551 URL: https://issues.apache.org/jira/browse/BEAM-551 Project: Beam Issue Type: New Feature Components: beam-model Reporter: Sam McVeety Assignee: Frances Perry Priority: Minor During the graph construction phase, the given SDK generates an initial execution graph for the program. At execution time, this graph is executed, either locally or by a service. Currently, Beam only supports parameterization at graph construction time. Both Flink and Spark supply functionality that allows a pre-compiled job to be run without SDK interaction with updated runtime parameters. In its current incarnation, Dataflow can read values of PipelineOptions at job submission time, but this requires the presence of an SDK to properly encode these values into the job. We would like to build a common layer into the Beam model so that these dynamic options can be properly provided to jobs. Please see https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_IK1r1YAJ90JG5Fz0_28o/edit for the high-level model, and https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMMkOSgGi8ZUH-MOnFatZ8/edit for the specific API proposal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-93) Support Compute Engine Subnetworks
Sam McVeety created BEAM-93: --- Summary: Support Compute Engine Subnetworks Key: BEAM-93 URL: https://issues.apache.org/jira/browse/BEAM-93 Project: Beam Issue Type: Improvement Components: runner-dataflow Reporter: Sam McVeety Assignee: Davor Bonaci Priority: Minor The Dataflow runner for Beam should support Compute Engine subnetworks for the workers: https://cloud.google.com/compute/docs/subnetworks -- This message was sent by Atlassian JIRA (v6.3.4#6332)