Re: Sources of bounded data in Beam on the Flink runner

Bill McCarthy Wed, 30 Mar 2016 11:01:01 -0700

Thanks Lukasz,

The model I'm pushing towards is an entirely reactive one, i.e.
- live data shows up, it gets processed immediately
- correction shows up for historical data, it gets processed immediately


Imagine I have a flow that's calculating daily aggregates (e.g. average). When 
a new live data point shows up, I want the aggregate for today to be 
recalculated and emitted. When a correction data point shows up, for yesterday, 
I want the aggregate for yesterday to be recalculated and emitted. 

The most simple way I can think of to implement this is to have Beam read all 
the data points off Kafka, with allowed lateness of some upper bound, and do 
all the processing in streaming mode. 

I don't think that that will work, though, if I need to restart the Beam 
component, as I'll lose the historical windows. So I think I need a persistent 
store, too, from which I can read all the historical for which I may need to 
handle corrections. 

Does that answer your question? Is there a more rational way to reactively 
handle corrections to old data such as this?

Bill

> On Mar 30, 2016, at 12:19 PM, Lukasz Cwik <[email protected]> wrote:
> 
> Your mental model is not off.
> 
> Conceptually within Beam you should be able to read from a bounded and 
> unbounded PCollection as you describe. With flatten you would create one 
> PCollection which contains the contents of both. Its important that the 
> watermark tracking for the bounded source follows the timestamps in your old 
> data otherwise that data may be considered "late" and may interact poorly 
> with the triggers specified on your window.into
> 
> Curious as to why you want to reprocess the old data?
> I have seen usecases where people want old data to affect their streaming 
> pipeline. Depending on how much old data, it may be worthwhile to record the 
> post processed version somewhere and load that directly vs reprocessing the 
> entire old data set.
> 
>> On Wed, Mar 30, 2016 at 7:37 AM, Bill McCarthy <[email protected]> 
>> wrote:
>> Thanks Max,
>> 
>> I'm unsure what you mean by the "batch part" and "streaming execution"... 
>> Are you saying that I have to run my entire pipeline in either batch or 
>> streaming modes?
>> 
>> Perhaps a brief description of what I'm trying to do would help:
>> 
>> 1. I've got some historical intervalized timeseries data
>> 2. I've also got some live intervalized timeseries data
>> 3. I want to process both those flows of data in a uniform fashion: 
>> calculating windowed statistics
>> 4. My thought was that I'd have the historical data stored in some easy data 
>> store for Beam (e.g. HDFS)
>> 5. I'd put live data on Kafka
>> 6. Then load up 2 PCollections, one from HDFS and one from Kafka
>> 7. Then perform a Flatten transform to get them in one PCollection
>> 8. Then run my windowing over the flattened PCollection
>> 9. ...
>> 
>> I like this approach, because my transforms don't have to care about whether 
>> data is live or historical and I can have one processing pipeline for both. 
>> 
>> Am I barking up the wrong tree? Is my mental model way off, with respect to 
>> how to combine historical and live data?
>> 
>> Thanks
>> 
>> Bill
>> 
>>> On Mar 30, 2016, at 5:59 AM, Maximilian Michels <[email protected]> wrote:
>>> 
>>> Hi Bill,
>>> 
>>> The batch part of the Flink Runners supports reading from a finite 
>>> collection, but I'm assuming you're working with the streaming execution of 
>>> the Flink runner. We haven't implemented support in the Runner yet but 
>>> Flink natively supports reading from finite sources. So it looks fairly 
>>> easy to implement. I've filed a JIRA issue and would like to look into this 
>>> later today.
>>> 
>>> I've recently added a test case (UnboundedSourceITCase) which throws a 
>>> custom exception when the end of the output has been reached. Admittedly, 
>>> that is not a very nice approach but it works. All the more, we should 
>>> support finite sources in streaming mode.
>>> 
>>> Best,
>>> Max
>>> 
>>>> On Wed, Mar 30, 2016 at 2:43 AM, William McCarthy 
>>>> <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> I want to get access to a bounded PCollection in Beam on the Flink runner. 
>>>> Ideally, I’d pull from HDFS. Is that supported? If not, what would be the 
>>>> best way for me to get something bounded, for testing purposes?
>>>> 
>>>> Thanks,
>>>> 
>>>> Bill
>

Re: Sources of bounded data in Beam on the Flink runner

Reply via email to