Re: Understanding GenerateSequence and SideInputs

2018-05-25 Thread Carlos Alonso
Many thanks for your comments. Really appreciate it!!

I'm not sure I understood the GenerateSequence guarantees explanation. By
"it will generate at most a given number..." you mean that it won't
generate more than the given number per period, right? If that was the case
maybe we need to review it closely as that's exactly what I've seen (I
configured 1 element every 5 minutes and it was consistently emitting every
5 minutes plus a few millis, except for a time that it emitted after 4
minutes and some seconds.

About the Square Enix link, many thanks Reza, that's something we are
following for our implementation.

Thanks everyone again!

On Fri, May 25, 2018 at 1:34 AM Reza Rokni  wrote:

> Hi,
>
> Not sure if this is useful for your use case but as you are using BQ with
> a changing schema the following may also be a interesting read ...
>
>
> https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>
> Cheers
>
> Reza
>
>
>
> On Fri, May 25, 2018, 5:50 AM Raghu Angadi  wrote:
>
>>
>> On Thu, May 24, 2018 at 1:11 PM Carlos Alonso 
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> I'm building a pipeline to store streaming data into BQ and I'm using
>>> the pattern: Slowly changing lookup cache described here:
>>> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>>>  to
>>> hold and refresh the table schemas (as they may change from time to time).
>>>
>>> Now I'd like to understand how that is scheduled on a distributed
>>> system. Who is running that code? One random node? One node but always the
>>> same? All nodes?
>>>
>>
>> GenerateSequence() is uses an unbounded source. Like any unbounded
>> source, it can has a set of 'splits' ('desiredNumSplits' [1] is set by
>> runtime). Each of the splits run in parallel.. a typical runtime
>> distributes these across the workers. Typically they stay on a worker
>> unless there is a reason to redistribute (autoscaling, workers unresponsive
>> etc). W.r.t. user application there are no guarantees about affinity.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>>
>>
>>>
>>> Also, what are the GenerateSequence guarantees in terms of precision? I
>>> have it configured to generate 1 element every 5 minutes and most of the
>>> time it works exact, but sometimes it doesn't... Is that expected?
>>>
>>
>> Each of the splits mentioned above essentially runs 'advance() [2]' in a
>> loop. It check current walltime to decide if it needs to emit next element.
>> If the value you see off by a few seconds, it would imply 'advance()' was
>> not called during that time by the framework. Runtime frameworks usually
>> don't provide any hard or soft deadlines for scheduling work.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>>
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426
>>
>>
>>> Regards
>>>
>>


Re: Generate Beam Pipeline from JSON

2018-05-25 Thread Kenneth Knowles
I would mention two other options to keep in mind:

 - In Python, the driver program can be tiny (much smaller than JSON
probably)
 - Beam SQL (tied to the Java SDK) is experimenting with a SQL shell/CLI.
It is not quite done, but you could follow
https://issues.apache.org/jira/browse/BEAM-3773.

Kenn

On Fri, May 25, 2018 at 8:51 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> That's on the Declarative DSLs scope: XML and JSON.
>
> It's not yet ready but it's a work in progress.
>
> You can follow:
>
> https://issues.apache.org/jira/browse/BEAM-14
>
> Regards
> JB
>
> On 25/05/2018 17:15, S. Sahayaraj wrote:
> > Hello,
> >
> > I would like to create the Beam pipeline (in Java) from
> > the definitions given in JSON file. Is there any Beam Compiler
> > available? Is any specification that guides me on how to do it?  I don’t
> > want to write the driver program for every pipeline. Please suggest.
> >
> >
> >
> > Cheers,
> >
> > S. Sahayaraj
> >
>
> --
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Generate Beam Pipeline from JSON

2018-05-25 Thread Jean-Baptiste Onofré
Hi,

That's on the Declarative DSLs scope: XML and JSON.

It's not yet ready but it's a work in progress.

You can follow:

https://issues.apache.org/jira/browse/BEAM-14

Regards
JB

On 25/05/2018 17:15, S. Sahayaraj wrote:
> Hello,
> 
>     I would like to create the Beam pipeline (in Java) from
> the definitions given in JSON file. Is there any Beam Compiler
> available? Is any specification that guides me on how to do it?  I don’t
> want to write the driver program for every pipeline. Please suggest.
> 
>  
> 
> Cheers,
> 
> S. Sahayaraj
> 

-- 
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Generate Beam Pipeline from JSON

2018-05-25 Thread S. Sahayaraj
Hello,
I would like to create the Beam pipeline (in Java) from the 
definitions given in JSON file. Is there any Beam Compiler available? Is any 
specification that guides me on how to do it?  I don't want to write the driver 
program for every pipeline. Please suggest.

Cheers,
S. Sahayaraj