[jira] [Commented] (BEAM-758) Per-step, per-execution nonce

2016-12-09 Thread Sam McVeety (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736212#comment-15736212
 ] 

Sam McVeety commented on BEAM-758:
--

Here's a sketch of how this could work, with default behavior that specific 
runners can override:

PipelineOptions {
...
@Default.InstanceFactory(/* Class that returns StaticValueProvider.of(UUID), 
for UUID generated at construction time */)
ValueProvider getPipelineId();
...
}

and a specific Runner can override this in fromOptions(), calling 
setPipelineId(RuntimeValueProvider rvp), with a Runner-specific RVP.

> Per-step, per-execution nonce
> -
>
> Key: BEAM-758
> URL: https://issues.apache.org/jira/browse/BEAM-758
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Affects Versions: Not applicable
>Reporter: Daniel Halperin
>Assignee: Sam McVeety
>
> In the forthcoming runner API, a user will be able to save a pipeline to JSON 
> and then run it repeatedly.
> Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random 
> value (nonce). These values are typically generated at apply time, so that 
> they are deterministic (don't change across retries of DoFns) and global (are 
> the same across all workers).
> However, once the runner API lands the existing code would result in the same 
> nonce being reused across jobs. Other possible solutions:
> * Generate nonce in {{Create(1) | ParDo}} then use this as a side input. 
> Should work, as along as side inputs are actually checkpointed. But does not 
> work for {{BoundedSource}}.
> * If a nonce is only needed for the lifetime of one bundle, can be generated 
> in {{startBundle}} and used in {{finishBundle}} [or {{tearDown}}].
> * Add some context somewhere that lets user code access unique step name, and 
> somehow generate a nonce consistently e.g. by hashing. Will usually work, but 
> this is similarly not available to sources.
> Another Q: I'm not sure we have a good way to generate nonces in unbounded 
> pipelines -- we probably need one. This would enable us to, e.g., use 
> {{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once 
> triggering per window]. Or generalizing to multiple firings...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-758) Per-step, per-execution nonce

2016-12-09 Thread Sam McVeety (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15735995#comment-15735995
 ] 

Sam McVeety commented on BEAM-758:
--

The runner-contingent option seems viable.  Something like: 
PipelineOptions.getPipelineId(), which would default to a nonce generated at 
construction time, but which a given runner could override seems doable with 
the ValueProvider architecture. 

> Per-step, per-execution nonce
> -
>
> Key: BEAM-758
> URL: https://issues.apache.org/jira/browse/BEAM-758
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Affects Versions: Not applicable
>Reporter: Daniel Halperin
>
> In the forthcoming runner API, a user will be able to save a pipeline to JSON 
> and then run it repeatedly.
> Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random 
> value (nonce). These values are typically generated at apply time, so that 
> they are deterministic (don't change across retries of DoFns) and global (are 
> the same across all workers).
> However, once the runner API lands the existing code would result in the same 
> nonce being reused across jobs. Other possible solutions:
> * Generate nonce in {{Create(1) | ParDo}} then use this as a side input. 
> Should work, as along as side inputs are actually checkpointed. But does not 
> work for {{BoundedSource}}.
> * If a nonce is only needed for the lifetime of one bundle, can be generated 
> in {{startBundle}} and used in {{finishBundle}} [or {{tearDown}}].
> * Add some context somewhere that lets user code access unique step name, and 
> somehow generate a nonce consistently e.g. by hashing. Will usually work, but 
> this is similarly not available to sources.
> Another Q: I'm not sure we have a good way to generate nonces in unbounded 
> pipelines -- we probably need one. This would enable us to, e.g., use 
> {{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once 
> triggering per window]. Or generalizing to multiple firings...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-1071) Support pre-existing tables with streaming BigQueryIO

2016-12-03 Thread Sam McVeety (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15718777#comment-15718777
 ] 

Sam McVeety commented on BEAM-1071:
---

[~dhalp...@google.com], the option I like the most is allowing Unbounded 
sources to have CREATE_NEVER as a create disposition.  What do you think about 
that?

> Support pre-existing tables with streaming BigQueryIO
> -
>
> Key: BEAM-1071
> URL: https://issues.apache.org/jira/browse/BEAM-1071
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-gcp
>Reporter: Sam McVeety
>Priority: Minor
>
> Specifically, with a tableRef function, CREATE_NEVER should be allowed for 
> pre-existing tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1071) Support pre-existing tables with streaming BigQueryIO

2016-12-01 Thread Sam McVeety (JIRA)
Sam McVeety created BEAM-1071:
-

 Summary: Support pre-existing tables with streaming BigQueryIO
 Key: BEAM-1071
 URL: https://issues.apache.org/jira/browse/BEAM-1071
 Project: Beam
  Issue Type: Improvement
Reporter: Sam McVeety
Priority: Minor


Specifically, with a tableRef function, CREATE_NEVER should be allowed for 
pre-existing tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-551) Support Dynamic PipelineOptions

2016-08-15 Thread Sam McVeety (JIRA)
Sam McVeety created BEAM-551:


 Summary: Support Dynamic PipelineOptions
 Key: BEAM-551
 URL: https://issues.apache.org/jira/browse/BEAM-551
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Sam McVeety
Assignee: Frances Perry
Priority: Minor


During the graph construction phase, the given SDK generates an initial
execution graph for the program.  At execution time, this graph is
executed, either locally or by a service.  Currently, Beam only supports
parameterization at graph construction time.  Both Flink and Spark supply
functionality that allows a pre-compiled job to be run without SDK
interaction with updated runtime parameters.

In its current incarnation, Dataflow can read values of PipelineOptions at
job submission time, but this requires the presence of an SDK to properly
encode these values into the job.  We would like to build a common layer
into the Beam model so that these dynamic options can be properly provided
to jobs.

Please see
https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_IK1r1YAJ90JG5Fz0_28o/edit
for the high-level model, and
https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMMkOSgGi8ZUH-MOnFatZ8/edit
for
the specific API proposal.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-93) Support Compute Engine Subnetworks

2016-03-03 Thread Sam McVeety (JIRA)
Sam McVeety created BEAM-93:
---

 Summary: Support Compute Engine Subnetworks
 Key: BEAM-93
 URL: https://issues.apache.org/jira/browse/BEAM-93
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: Sam McVeety
Assignee: Davor Bonaci
Priority: Minor


The Dataflow runner for Beam should support Compute Engine subnetworks for the 
workers: https://cloud.google.com/compute/docs/subnetworks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)