Re: Dealing with "large" checkpoint state of a Beam pipeline in Flink

2019-02-12 Thread Kaymak, Tobias
Thank you! I am using similar values but my problem was that my FILE_LOADS
were sometimes failing and this lead to this behavior. The pipeline didnt
fail though (which I was assuming it would do) it simply retried the
loading forever. (Retries were set to the default (-1)).

On Tue, Feb 12, 2019 at 6:11 PM Juan Carlos Garcia 
wrote:

> I forgot to mention that we uses hdfs as storage for checkpoint /
> savepoint.
>
> Juan Carlos Garcia  schrieb am Di., 12. Feb. 2019,
> 18:03:
>
>> Hi Tobias,
>>
>> I think this can happen when there is a lot of backpressure on the
>> pipeline.
>>
>> Don't know if it's normal but i have a pipeline reading from KafkaIO and
>> pushing to bigquery instreaming mode and i have seen checkpoint of almost
>> 1gb and whenever i am doing a savepoint for updating the pipeline it can
>> goes up to 8 GB of data on a savepoint.
>>
>> I am on Flink 1.5.x, on premises also using Rockdb and incremental.
>>
>> So far my only solutionto avoid errors while checkpointing or
>> savepointing is to make sure the checkpoint Timeout is high enough like 20m
>> or 30min.
>>
>>
>> Kaymak, Tobias  schrieb am Di., 12. Feb. 2019,
>> 17:33:
>>
>>> Hi,
>>>
>>> my Beam 2.10-SNAPSHOT pipeline has a KafkaIO as input and a BigQueryIO
>>> configured with FILE_LOADS as output. What bothers me is that even if I
>>> configure in my Flink 1.6 configuration
>>>
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>>
>>> I see states that are as big as 230 MiB and checkpoint timeouts, or
>>> checkpoints that take longer than 10 minutes to complete (I just saw one
>>> that took longer than 30 minutes).
>>>
>>> Am I missing something? Is there some room for improvement? Should I use
>>> a different storage backend for the checkpoints? (Currently they are stored
>>> on GCS).
>>>
>>> Best,
>>> Tobi
>>>
>>


Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
wow, thats super unexpected and dangerous, thanks for clarifying!  Time to
go re-think how we do some of our writes w/ early firings then.

Are there any workarounds to make things happen in-order in dataflow?  eg
if the sink gets fused to the output of the GBK operation, will it always
happen effectively in order (per key) even though it's not a guarantee?  I
also guess I could keep track of the last pane index my sink has seen, and
ignore earlier ones (using state to keep track)?


On Tue, Feb 12, 2019 at 1:28 PM Robert Bradshaw  wrote:

> Correct, even within the same key there's no promise of event time
> ordering mapping of panes to real time ordering because the downstream
> operations *may* happen on a different machine. Multiply triggered
> windows add an element of non-determinism to the process.
>
> You're also correct that triggering with multiple panes requires lots of
> care, especially when it comes to operations with side effects (like
> sinks). Most safe is to only write the final pane to the sink, and handle
> early triggering in a different way.
> https://s.apache.org/beam-sink-triggers is a proposal to make this easier
> to reason about.
>
>
> On Tue, Feb 12, 2019 at 7:19 PM Steve Niemitz  wrote:
>
>> Also to clarify here (I re-read this and realized it could be slightly
>> unclear).  My question is only about in-order delivery of panes.  ie: will
>> pane P always be delivered before P+1.
>>
>> I realize the use of "in-order" before could be confusing, I don't care
>> about the ordering of the elements per-se, just the ordering of the pane
>> delivery.
>>
>> I want to make sure that given a GBK that produces 3 panes (P0, P1, P2)
>> for a key, a downstream PCollection could never see P0, P2, P1.  OR at
>> least, the final firing is always guaranteed to be delivered after all
>> early-firings (eg we could have P0, P2, P1, but then always PLast).
>>
>> On Tue, Feb 12, 2019 at 11:48 AM Steve Niemitz 
>> wrote:
>>
>>> Are you also saying also that even in the first example (Source ->
>>> CombineByKey (Sum) -> Sink) there's no guarantee that events would be
>>> delivered in-order from the Combine -> Sink transforms?  This seems like a
>>> pretty big "got-cha" for correctness if you ever use accumulating
>>> triggering.
>>>
>>> I'd also like to point out I'm not talking about a global ordering
>>> across the entire PCollection, I'm talking about within the same key after
>>> a GBK transform.
>>>
>>> On Tue, Feb 12, 2019 at 11:35 AM Robert Bradshaw 
>>> wrote:
>>>
 Due to the nature of distributed processing, order is not preserved.
 You can, however, inspect the PaneInfo to determine if an element was
 early, on-time, or late and act accordingly.

 On Tue, Feb 12, 2019 at 5:15 PM Juan Carlos Garcia 
 wrote:

> In my experience ordering is not guaranteed, you may need apply a
> transformation that sort the elements and then dispatch them sorted out.
>
> Or uses the Sorter extension for this:
>
> https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter
>
> Steve Niemitz  schrieb am Di., 12. Feb. 2019,
> 16:31:
>
>> Hi everyone, I have some questions I want to ask about how windowing,
>> triggering, and panes work together, and how to ensure correctness
>> throughout a pipeline.
>>
>> Lets assume I have a very simple streaming pipeline that looks like:
>> Source -> CombineByKey (Sum) -> Sink
>>
>> Given fixed windows of 1 hour, early firings every minute, and
>> accumulating panes, this is pretty straight forward.  However, this can 
>> get
>> more complicated if we add steps after the CombineByKey, for instance
>> (using the same windowing strategy):
>>
>> Say I want to buffer the results of the CombineByKey into batches of
>> N elements.  I can do this with the built-in GroupIntoBatches [1]
>> transform, now my pipeline looks like:
>>
>> Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink
>>
>> *This leads to my main question:*
>> Is ordering preserved somehow here?  ie: is it possible that the
>> result from early firing F+1 now comes BEFORE the firing F (because it 
>> was
>> re-ordered in the GroupIntoBatches).  This would mean that the sink then
>> gets F+1 before F, which means my resulting store has incorrect data
>> (possibly forever if F+1 was the final firing).
>>
>> If ordering is not preserved, it seems as if I'd need to introduce my
>> own ordering back in after GroupIntoBatches.  GIB is an example here, 
>> but I
>> imagine this could happen with any GBK type operation.
>>
>> Am I thinking about this the correct way?  Thanks!
>>
>> [1]
>> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html
>>
>


Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Robert Bradshaw
Correct, even within the same key there's no promise of event time ordering
mapping of panes to real time ordering because the downstream operations
*may* happen on a different machine. Multiply triggered windows add an
element of non-determinism to the process.

You're also correct that triggering with multiple panes requires lots of
care, especially when it comes to operations with side effects (like
sinks). Most safe is to only write the final pane to the sink, and handle
early triggering in a different way. https://s.apache.org/beam-sink-triggers
is a proposal to make this easier to reason about.


On Tue, Feb 12, 2019 at 7:19 PM Steve Niemitz  wrote:

> Also to clarify here (I re-read this and realized it could be slightly
> unclear).  My question is only about in-order delivery of panes.  ie: will
> pane P always be delivered before P+1.
>
> I realize the use of "in-order" before could be confusing, I don't care
> about the ordering of the elements per-se, just the ordering of the pane
> delivery.
>
> I want to make sure that given a GBK that produces 3 panes (P0, P1, P2)
> for a key, a downstream PCollection could never see P0, P2, P1.  OR at
> least, the final firing is always guaranteed to be delivered after all
> early-firings (eg we could have P0, P2, P1, but then always PLast).
>
> On Tue, Feb 12, 2019 at 11:48 AM Steve Niemitz 
> wrote:
>
>> Are you also saying also that even in the first example (Source ->
>> CombineByKey (Sum) -> Sink) there's no guarantee that events would be
>> delivered in-order from the Combine -> Sink transforms?  This seems like a
>> pretty big "got-cha" for correctness if you ever use accumulating
>> triggering.
>>
>> I'd also like to point out I'm not talking about a global ordering across
>> the entire PCollection, I'm talking about within the same key after a GBK
>> transform.
>>
>> On Tue, Feb 12, 2019 at 11:35 AM Robert Bradshaw 
>> wrote:
>>
>>> Due to the nature of distributed processing, order is not preserved. You
>>> can, however, inspect the PaneInfo to determine if an element was early,
>>> on-time, or late and act accordingly.
>>>
>>> On Tue, Feb 12, 2019 at 5:15 PM Juan Carlos Garcia 
>>> wrote:
>>>
 In my experience ordering is not guaranteed, you may need apply a
 transformation that sort the elements and then dispatch them sorted out.

 Or uses the Sorter extension for this:

 https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter

 Steve Niemitz  schrieb am Di., 12. Feb. 2019,
 16:31:

> Hi everyone, I have some questions I want to ask about how windowing,
> triggering, and panes work together, and how to ensure correctness
> throughout a pipeline.
>
> Lets assume I have a very simple streaming pipeline that looks like:
> Source -> CombineByKey (Sum) -> Sink
>
> Given fixed windows of 1 hour, early firings every minute, and
> accumulating panes, this is pretty straight forward.  However, this can 
> get
> more complicated if we add steps after the CombineByKey, for instance
> (using the same windowing strategy):
>
> Say I want to buffer the results of the CombineByKey into batches of N
> elements.  I can do this with the built-in GroupIntoBatches [1] transform,
> now my pipeline looks like:
>
> Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink
>
> *This leads to my main question:*
> Is ordering preserved somehow here?  ie: is it possible that the
> result from early firing F+1 now comes BEFORE the firing F (because it was
> re-ordered in the GroupIntoBatches).  This would mean that the sink then
> gets F+1 before F, which means my resulting store has incorrect data
> (possibly forever if F+1 was the final firing).
>
> If ordering is not preserved, it seems as if I'd need to introduce my
> own ordering back in after GroupIntoBatches.  GIB is an example here, but 
> I
> imagine this could happen with any GBK type operation.
>
> Am I thinking about this the correct way?  Thanks!
>
> [1]
> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html
>



Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
Also to clarify here (I re-read this and realized it could be slightly
unclear).  My question is only about in-order delivery of panes.  ie: will
pane P always be delivered before P+1.

I realize the use of "in-order" before could be confusing, I don't care
about the ordering of the elements per-se, just the ordering of the pane
delivery.

I want to make sure that given a GBK that produces 3 panes (P0, P1, P2) for
a key, a downstream PCollection could never see P0, P2, P1.  OR at least,
the final firing is always guaranteed to be delivered after all
early-firings (eg we could have P0, P2, P1, but then always PLast).

On Tue, Feb 12, 2019 at 11:48 AM Steve Niemitz  wrote:

> Are you also saying also that even in the first example (Source ->
> CombineByKey (Sum) -> Sink) there's no guarantee that events would be
> delivered in-order from the Combine -> Sink transforms?  This seems like a
> pretty big "got-cha" for correctness if you ever use accumulating
> triggering.
>
> I'd also like to point out I'm not talking about a global ordering across
> the entire PCollection, I'm talking about within the same key after a GBK
> transform.
>
> On Tue, Feb 12, 2019 at 11:35 AM Robert Bradshaw 
> wrote:
>
>> Due to the nature of distributed processing, order is not preserved. You
>> can, however, inspect the PaneInfo to determine if an element was early,
>> on-time, or late and act accordingly.
>>
>> On Tue, Feb 12, 2019 at 5:15 PM Juan Carlos Garcia 
>> wrote:
>>
>>> In my experience ordering is not guaranteed, you may need apply a
>>> transformation that sort the elements and then dispatch them sorted out.
>>>
>>> Or uses the Sorter extension for this:
>>>
>>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter
>>>
>>> Steve Niemitz  schrieb am Di., 12. Feb. 2019,
>>> 16:31:
>>>
 Hi everyone, I have some questions I want to ask about how windowing,
 triggering, and panes work together, and how to ensure correctness
 throughout a pipeline.

 Lets assume I have a very simple streaming pipeline that looks like:
 Source -> CombineByKey (Sum) -> Sink

 Given fixed windows of 1 hour, early firings every minute, and
 accumulating panes, this is pretty straight forward.  However, this can get
 more complicated if we add steps after the CombineByKey, for instance
 (using the same windowing strategy):

 Say I want to buffer the results of the CombineByKey into batches of N
 elements.  I can do this with the built-in GroupIntoBatches [1] transform,
 now my pipeline looks like:

 Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink

 *This leads to my main question:*
 Is ordering preserved somehow here?  ie: is it possible that the result
 from early firing F+1 now comes BEFORE the firing F (because it was
 re-ordered in the GroupIntoBatches).  This would mean that the sink then
 gets F+1 before F, which means my resulting store has incorrect data
 (possibly forever if F+1 was the final firing).

 If ordering is not preserved, it seems as if I'd need to introduce my
 own ordering back in after GroupIntoBatches.  GIB is an example here, but I
 imagine this could happen with any GBK type operation.

 Am I thinking about this the correct way?  Thanks!

 [1]
 https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html

>>>


Re: Dealing with "large" checkpoint state of a Beam pipeline in Flink

2019-02-12 Thread Juan Carlos Garcia
I forgot to mention that we uses hdfs as storage for checkpoint /
savepoint.

Juan Carlos Garcia  schrieb am Di., 12. Feb. 2019,
18:03:

> Hi Tobias,
>
> I think this can happen when there is a lot of backpressure on the
> pipeline.
>
> Don't know if it's normal but i have a pipeline reading from KafkaIO and
> pushing to bigquery instreaming mode and i have seen checkpoint of almost
> 1gb and whenever i am doing a savepoint for updating the pipeline it can
> goes up to 8 GB of data on a savepoint.
>
> I am on Flink 1.5.x, on premises also using Rockdb and incremental.
>
> So far my only solutionto avoid errors while checkpointing or savepointing
> is to make sure the checkpoint Timeout is high enough like 20m or 30min.
>
>
> Kaymak, Tobias  schrieb am Di., 12. Feb. 2019,
> 17:33:
>
>> Hi,
>>
>> my Beam 2.10-SNAPSHOT pipeline has a KafkaIO as input and a BigQueryIO
>> configured with FILE_LOADS as output. What bothers me is that even if I
>> configure in my Flink 1.6 configuration
>>
>> state.backend: rocksdb
>> state.backend.incremental: true
>>
>> I see states that are as big as 230 MiB and checkpoint timeouts, or
>> checkpoints that take longer than 10 minutes to complete (I just saw one
>> that took longer than 30 minutes).
>>
>> Am I missing something? Is there some room for improvement? Should I use
>> a different storage backend for the checkpoints? (Currently they are stored
>> on GCS).
>>
>> Best,
>> Tobi
>>
>


Re: Dealing with "large" checkpoint state of a Beam pipeline in Flink

2019-02-12 Thread Juan Carlos Garcia
Hi Tobias,

I think this can happen when there is a lot of backpressure on the
pipeline.

Don't know if it's normal but i have a pipeline reading from KafkaIO and
pushing to bigquery instreaming mode and i have seen checkpoint of almost
1gb and whenever i am doing a savepoint for updating the pipeline it can
goes up to 8 GB of data on a savepoint.

I am on Flink 1.5.x, on premises also using Rockdb and incremental.

So far my only solutionto avoid errors while checkpointing or savepointing
is to make sure the checkpoint Timeout is high enough like 20m or 30min.


Kaymak, Tobias  schrieb am Di., 12. Feb. 2019,
17:33:

> Hi,
>
> my Beam 2.10-SNAPSHOT pipeline has a KafkaIO as input and a BigQueryIO
> configured with FILE_LOADS as output. What bothers me is that even if I
> configure in my Flink 1.6 configuration
>
> state.backend: rocksdb
> state.backend.incremental: true
>
> I see states that are as big as 230 MiB and checkpoint timeouts, or
> checkpoints that take longer than 10 minutes to complete (I just saw one
> that took longer than 30 minutes).
>
> Am I missing something? Is there some room for improvement? Should I use a
> different storage backend for the checkpoints? (Currently they are stored
> on GCS).
>
> Best,
> Tobi
>


Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
Are you also saying also that even in the first example (Source ->
CombineByKey (Sum) -> Sink) there's no guarantee that events would be
delivered in-order from the Combine -> Sink transforms?  This seems like a
pretty big "got-cha" for correctness if you ever use accumulating
triggering.

I'd also like to point out I'm not talking about a global ordering across
the entire PCollection, I'm talking about within the same key after a GBK
transform.

On Tue, Feb 12, 2019 at 11:35 AM Robert Bradshaw 
wrote:

> Due to the nature of distributed processing, order is not preserved. You
> can, however, inspect the PaneInfo to determine if an element was early,
> on-time, or late and act accordingly.
>
> On Tue, Feb 12, 2019 at 5:15 PM Juan Carlos Garcia 
> wrote:
>
>> In my experience ordering is not guaranteed, you may need apply a
>> transformation that sort the elements and then dispatch them sorted out.
>>
>> Or uses the Sorter extension for this:
>>
>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter
>>
>> Steve Niemitz  schrieb am Di., 12. Feb. 2019, 16:31:
>>
>>> Hi everyone, I have some questions I want to ask about how windowing,
>>> triggering, and panes work together, and how to ensure correctness
>>> throughout a pipeline.
>>>
>>> Lets assume I have a very simple streaming pipeline that looks like:
>>> Source -> CombineByKey (Sum) -> Sink
>>>
>>> Given fixed windows of 1 hour, early firings every minute, and
>>> accumulating panes, this is pretty straight forward.  However, this can get
>>> more complicated if we add steps after the CombineByKey, for instance
>>> (using the same windowing strategy):
>>>
>>> Say I want to buffer the results of the CombineByKey into batches of N
>>> elements.  I can do this with the built-in GroupIntoBatches [1] transform,
>>> now my pipeline looks like:
>>>
>>> Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink
>>>
>>> *This leads to my main question:*
>>> Is ordering preserved somehow here?  ie: is it possible that the result
>>> from early firing F+1 now comes BEFORE the firing F (because it was
>>> re-ordered in the GroupIntoBatches).  This would mean that the sink then
>>> gets F+1 before F, which means my resulting store has incorrect data
>>> (possibly forever if F+1 was the final firing).
>>>
>>> If ordering is not preserved, it seems as if I'd need to introduce my
>>> own ordering back in after GroupIntoBatches.  GIB is an example here, but I
>>> imagine this could happen with any GBK type operation.
>>>
>>> Am I thinking about this the correct way?  Thanks!
>>>
>>> [1]
>>> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html
>>>
>>


Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Robert Bradshaw
Due to the nature of distributed processing, order is not preserved. You
can, however, inspect the PaneInfo to determine if an element was early,
on-time, or late and act accordingly.

On Tue, Feb 12, 2019 at 5:15 PM Juan Carlos Garcia 
wrote:

> In my experience ordering is not guaranteed, you may need apply a
> transformation that sort the elements and then dispatch them sorted out.
>
> Or uses the Sorter extension for this:
>
> https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter
>
> Steve Niemitz  schrieb am Di., 12. Feb. 2019, 16:31:
>
>> Hi everyone, I have some questions I want to ask about how windowing,
>> triggering, and panes work together, and how to ensure correctness
>> throughout a pipeline.
>>
>> Lets assume I have a very simple streaming pipeline that looks like:
>> Source -> CombineByKey (Sum) -> Sink
>>
>> Given fixed windows of 1 hour, early firings every minute, and
>> accumulating panes, this is pretty straight forward.  However, this can get
>> more complicated if we add steps after the CombineByKey, for instance
>> (using the same windowing strategy):
>>
>> Say I want to buffer the results of the CombineByKey into batches of N
>> elements.  I can do this with the built-in GroupIntoBatches [1] transform,
>> now my pipeline looks like:
>>
>> Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink
>>
>> *This leads to my main question:*
>> Is ordering preserved somehow here?  ie: is it possible that the result
>> from early firing F+1 now comes BEFORE the firing F (because it was
>> re-ordered in the GroupIntoBatches).  This would mean that the sink then
>> gets F+1 before F, which means my resulting store has incorrect data
>> (possibly forever if F+1 was the final firing).
>>
>> If ordering is not preserved, it seems as if I'd need to introduce my own
>> ordering back in after GroupIntoBatches.  GIB is an example here, but I
>> imagine this could happen with any GBK type operation.
>>
>> Am I thinking about this the correct way?  Thanks!
>>
>> [1]
>> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html
>>
>


Dealing with "large" checkpoint state of a Beam pipeline in Flink

2019-02-12 Thread Kaymak, Tobias
Hi,

my Beam 2.10-SNAPSHOT pipeline has a KafkaIO as input and a BigQueryIO
configured with FILE_LOADS as output. What bothers me is that even if I
configure in my Flink 1.6 configuration

state.backend: rocksdb
state.backend.incremental: true

I see states that are as big as 230 MiB and checkpoint timeouts, or
checkpoints that take longer than 10 minutes to complete (I just saw one
that took longer than 30 minutes).

Am I missing something? Is there some room for improvement? Should I use a
different storage backend for the checkpoints? (Currently they are stored
on GCS).

Best,
Tobi


Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Juan Carlos Garcia
In my experience ordering is not guaranteed, you may need apply a
transformation that sort the elements and then dispatch them sorted out.

Or uses the Sorter extension for this:

https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter

Steve Niemitz  schrieb am Di., 12. Feb. 2019, 16:31:

> Hi everyone, I have some questions I want to ask about how windowing,
> triggering, and panes work together, and how to ensure correctness
> throughout a pipeline.
>
> Lets assume I have a very simple streaming pipeline that looks like:
> Source -> CombineByKey (Sum) -> Sink
>
> Given fixed windows of 1 hour, early firings every minute, and
> accumulating panes, this is pretty straight forward.  However, this can get
> more complicated if we add steps after the CombineByKey, for instance
> (using the same windowing strategy):
>
> Say I want to buffer the results of the CombineByKey into batches of N
> elements.  I can do this with the built-in GroupIntoBatches [1] transform,
> now my pipeline looks like:
>
> Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink
>
> *This leads to my main question:*
> Is ordering preserved somehow here?  ie: is it possible that the result
> from early firing F+1 now comes BEFORE the firing F (because it was
> re-ordered in the GroupIntoBatches).  This would mean that the sink then
> gets F+1 before F, which means my resulting store has incorrect data
> (possibly forever if F+1 was the final firing).
>
> If ordering is not preserved, it seems as if I'd need to introduce my own
> ordering back in after GroupIntoBatches.  GIB is an example here, but I
> imagine this could happen with any GBK type operation.
>
> Am I thinking about this the correct way?  Thanks!
>
> [1]
> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html
>


Re: Visual Beam - First demonstration - London

2019-02-12 Thread Maximilian Michels
Yes, you can use Flink's local execution mode, which is the default if 
you don't provide any settings. A cluster should not be necessary to 
complete the integration. Ideally, it should work out of the box :)



However, I'm first trying to solve the complicated issue of grouping records 
together in Beam in a safe way so that they can batched up


I'm not sure what your use case is but Beam does batching by default. 
The batches are called bundles. The Flink Runner supports setting the 
bundle size.


Cheers,
Max

On 12.02.19 12:20, Matt Casters wrote:
Yes, Flink is obviously the next target.  I'm not expecting too many 
issues there beyond getting a cluster set up to test on.  I read you can 
run the Flink Runner locally so that will help a lot in testing.


However, I'm first trying to solve the complicated issue of grouping 
records together in Beam in a safe way so that they can batched up.
Batching up is really important for fast loading into a lot of output 
targets.  I'll probably use some group by behind the scenes or something 
like that, need to think about that.
Having the ability to re-use the existing Kettle steps without having to 
write new code is really key.


Once that is done (in a few weeks) I'll give Flink a shot.

Cheers,

Matt

Op di 12 feb. 2019 om 12:02 schreef Maximilian Michels >:


@Dan: Thanks for sharing the presentation. Kettle is a great way to
make
Beam more accessible.

@Matt: Thanks for the plug. It's good to hear you enjoyed it. I think
the link to your slides got messed up: http://beam.kettle.be

Are you planning to add execution via the Flink Runner to Kettle?
Saw in
the presentation that you already support Direct, Spark, and Dataflow.

On 11.02.19 20:50, Matt Casters wrote:
 > By the way, Maximilian, I linked and plugged your wonderful FOSDEM
 > presentation in my slides http://beam kettle.be
  slide
 > 19. If you mind, let me know and I'll get it out of the slides.
In any
 > case, great content worth promoting I thought.
 >
 > Op wo 6 feb. 2019 18:03 schreef Maximilian Michels
mailto:m...@apache.org>
 > >:
 >
 >     Hi Dan,
 >
 >     Thanks for the info. Would be great to share a video of the
 >     presentation.
 >
 >     Cheers,
 >     Max
 >
 >     On 30.01.19 10:00, Dan wrote:
 >      > Hi, in just over a week you're all welcome to come and see
the very
 >      > first public reveal of Kettle running on beam! (Including
spark,
 >      > dataflow and flink support)
 >      >
 >      >
https://www.meetup.com/Pentaho-London-User-Group/events/256773962/
 >      >
 >      > So this ingenious integration combines the power of visual
 >     development,
 >      > with the platform agnostic benefits of beam - impressive
stuff. No
 >      > vendor lock-in here!
 >      >
 >      >
 >      > See you there!
 >      > Dan
 >



Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
Hi everyone, I have some questions I want to ask about how windowing,
triggering, and panes work together, and how to ensure correctness
throughout a pipeline.

Lets assume I have a very simple streaming pipeline that looks like:
Source -> CombineByKey (Sum) -> Sink

Given fixed windows of 1 hour, early firings every minute, and accumulating
panes, this is pretty straight forward.  However, this can get more
complicated if we add steps after the CombineByKey, for instance (using the
same windowing strategy):

Say I want to buffer the results of the CombineByKey into batches of N
elements.  I can do this with the built-in GroupIntoBatches [1] transform,
now my pipeline looks like:

Source -> CombineByKey (Sum) -> GroupIntoBatches -> Sink

*This leads to my main question:*
Is ordering preserved somehow here?  ie: is it possible that the result
from early firing F+1 now comes BEFORE the firing F (because it was
re-ordered in the GroupIntoBatches).  This would mean that the sink then
gets F+1 before F, which means my resulting store has incorrect data
(possibly forever if F+1 was the final firing).

If ordering is not preserved, it seems as if I'd need to introduce my own
ordering back in after GroupIntoBatches.  GIB is an example here, but I
imagine this could happen with any GBK type operation.

Am I thinking about this the correct way?  Thanks!

[1]
https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html


Re: Visual Beam - First demonstration - London

2019-02-12 Thread Matt Casters
Yes, Flink is obviously the next target.  I'm not expecting too many issues
there beyond getting a cluster set up to test on.  I read you can run the
Flink Runner locally so that will help a lot in testing.

However, I'm first trying to solve the complicated issue of grouping
records together in Beam in a safe way so that they can batched up.
Batching up is really important for fast loading into a lot of output
targets.  I'll probably use some group by behind the scenes or something
like that, need to think about that.
Having the ability to re-use the existing Kettle steps without having to
write new code is really key.

Once that is done (in a few weeks) I'll give Flink a shot.

Cheers,

Matt

Op di 12 feb. 2019 om 12:02 schreef Maximilian Michels :

> @Dan: Thanks for sharing the presentation. Kettle is a great way to make
> Beam more accessible.
>
> @Matt: Thanks for the plug. It's good to hear you enjoyed it. I think
> the link to your slides got messed up: http://beam.kettle.be
>
> Are you planning to add execution via the Flink Runner to Kettle? Saw in
> the presentation that you already support Direct, Spark, and Dataflow.
>
> On 11.02.19 20:50, Matt Casters wrote:
> > By the way, Maximilian, I linked and plugged your wonderful FOSDEM
> > presentation in my slides http://beam kettle.be 
> slide
> > 19. If you mind, let me know and I'll get it out of the slides. In any
> > case, great content worth promoting I thought.
> >
> > Op wo 6 feb. 2019 18:03 schreef Maximilian Michels  > :
> >
> > Hi Dan,
> >
> > Thanks for the info. Would be great to share a video of the
> > presentation.
> >
> > Cheers,
> > Max
> >
> > On 30.01.19 10:00, Dan wrote:
> >  > Hi, in just over a week you're all welcome to come and see the
> very
> >  > first public reveal of Kettle running on beam! (Including spark,
> >  > dataflow and flink support)
> >  >
> >  >
> https://www.meetup.com/Pentaho-London-User-Group/events/256773962/
> >  >
> >  > So this ingenious integration combines the power of visual
> > development,
> >  > with the platform agnostic benefits of beam - impressive stuff. No
> >  > vendor lock-in here!
> >  >
> >  >
> >  > See you there!
> >  > Dan
> >
>


Re: Visual Beam - First demonstration - London

2019-02-12 Thread Maximilian Michels
@Dan: Thanks for sharing the presentation. Kettle is a great way to make 
Beam more accessible.


@Matt: Thanks for the plug. It's good to hear you enjoyed it. I think 
the link to your slides got messed up: http://beam.kettle.be


Are you planning to add execution via the Flink Runner to Kettle? Saw in 
the presentation that you already support Direct, Spark, and Dataflow.


On 11.02.19 20:50, Matt Casters wrote:
By the way, Maximilian, I linked and plugged your wonderful FOSDEM 
presentation in my slides http://beam kettle.be  slide 
19. If you mind, let me know and I'll get it out of the slides. In any 
case, great content worth promoting I thought.


Op wo 6 feb. 2019 18:03 schreef Maximilian Michels :


Hi Dan,

Thanks for the info. Would be great to share a video of the
presentation.

Cheers,
Max

On 30.01.19 10:00, Dan wrote:
 > Hi, in just over a week you're all welcome to come and see the very
 > first public reveal of Kettle running on beam! (Including spark,
 > dataflow and flink support)
 >
 > https://www.meetup.com/Pentaho-London-User-Group/events/256773962/
 >
 > So this ingenious integration combines the power of visual
development,
 > with the platform agnostic benefits of beam - impressive stuff. No
 > vendor lock-in here!
 >
 >
 > See you there!
 > Dan