Many thanks for your comments. Really appreciate it!!
I'm not sure I understood the GenerateSequence guarantees explanation. By
"it will generate at most a given number..." you mean that it won't
generate more than the given number per period, right? If that was the case
maybe we need to review it closely as that's exactly what I've seen (I
configured 1 element every 5 minutes and it was consistently emitting every
5 minutes plus a few millis, except for a time that it emitted after 4
minutes and some seconds.
About the Square Enix link, many thanks Reza, that's something we are
following for our implementation.
Thanks everyone again!
On Fri, May 25, 2018 at 1:34 AM Reza Rokni wrote:
> Hi,
>
> Not sure if this is useful for your use case but as you are using BQ with
> a changing schema the following may also be a interesting read ...
>
>
> https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>
> Cheers
>
> Reza
>
>
>
> On Fri, May 25, 2018, 5:50 AM Raghu Angadi wrote:
>
>>
>> On Thu, May 24, 2018 at 1:11 PM Carlos Alonso
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> I'm building a pipeline to store streaming data into BQ and I'm using
>>> the pattern: Slowly changing lookup cache described here:
>>> https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>>> to
>>> hold and refresh the table schemas (as they may change from time to time).
>>>
>>> Now I'd like to understand how that is scheduled on a distributed
>>> system. Who is running that code? One random node? One node but always the
>>> same? All nodes?
>>>
>>
>> GenerateSequence() is uses an unbounded source. Like any unbounded
>> source, it can has a set of 'splits' ('desiredNumSplits' [1] is set by
>> runtime). Each of the splits run in parallel.. a typical runtime
>> distributes these across the workers. Typically they stay on a worker
>> unless there is a reason to redistribute (autoscaling, workers unresponsive
>> etc). W.r.t. user application there are no guarantees about affinity.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>>
>>
>>>
>>> Also, what are the GenerateSequence guarantees in terms of precision? I
>>> have it configured to generate 1 element every 5 minutes and most of the
>>> time it works exact, but sometimes it doesn't... Is that expected?
>>>
>>
>> Each of the splits mentioned above essentially runs 'advance() [2]' in a
>> loop. It check current walltime to decide if it needs to emit next element.
>> If the value you see off by a few seconds, it would imply 'advance()' was
>> not called during that time by the framework. Runtime frameworks usually
>> don't provide any hard or soft deadlines for scheduling work.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L337
>>
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L426
>>
>>
>>> Regards
>>>
>>