Re: Correlation between data streams/operators and threads

Piotr Nowojski Mon, 13 Nov 2017 05:52:39 -0800

Sure, let us know if you have other questions or encounter some issues.

Thanks, Piotrek


> On 13 Nov 2017, at 14:49, Shailesh Jain <shailesh.j...@stellapps.com> wrote:
> 
> Thanks, Piotr. I'll try it out and will get back in case of any further 
> questions.
> 
> Shailesh
> 
> On Fri, Nov 10, 2017 at 5:52 PM, Piotr Nowojski <pi...@data-artisans.com 
> <mailto:pi...@data-artisans.com>> wrote:
> 1.  It’s a little bit more complicated then that. Each operator chain/task 
> will be executed in separate thread (parallelism
>  Multiplies that). You can check in web ui how was your job split into tasks.
> 
> 3. Yes that’s true, this is an issue. To preserve the individual 
> watermarks/latencies (assuming that you have some way to calculate them 
> individually per each device), you could either:
> 
> a) have separate jobs per each device with parallelism 1. Pros: independent 
> failures/checkpoints, Cons: resource usage (number of threads increases with 
> number of devices, there are also other resources consumed by each job), 
> efficiency, 
> b) have one job with multiple data streams. Cons: resource usage (threads)
> c) ignore Flink’s watermarks, and implement your own code in place of it. You 
> could read all of your data in single data stream, keyBy partition/device and 
> manually handle watermarks logic. You could either try to wrap CEP/Window 
> operators or copy/paste and modify them to suite your needs. 
> 
> I would start and try out from a). If it work for your cluster/scale then 
> that’s fine. If not try b) (would share most of the code with a), and as a 
> last resort try c).
> 
> Kostas, would you like to add something?
> 
> Piotrek
> 
>> On 9 Nov 2017, at 19:16, Shailesh Jain <shailesh.j...@stellapps.com 
>> <mailto:shailesh.j...@stellapps.com>> wrote:
>> 
>> On 1. - is it tied specifically to the number of source operators or to the 
>> number of Datastream objects created. I mean does the answer change if I 
>> read all the data from a single Kafka topic, get a Datastream of all events, 
>> and the apply N filters to create N individual streams?
>> 
>> On 3. - the problem with partitions is that watermarks cannot be different 
>> per partition, and since in this use case, each stream is from a device, the 
>> latency could be different (but order will be correct almost always) and 
>> there are high chances of loosing out on events on operators like Patterns 
>> which work with windows. Any ideas for workarounds here?
>> 
>> 
>> Thanks,
>> Shailesh
>> 
>> On 09-Nov-2017 8:48 PM, "Piotr Nowojski" <pi...@data-artisans.com 
>> <mailto:pi...@data-artisans.com>> wrote:
>> Hi,
>> 
>> 1. 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html>
>> 
>> Number of threads executing would be roughly speaking equal to of the number 
>> of input data streams multiplied by the parallelism.
>> 
>> 2. 
>> Yes, you could dynamically create more data streams at the job startup.
>> 
>> 3.
>> Running 10000 independent data streams on a small cluster (couple of nodes) 
>> will definitely be an issue, since even with parallelism set to 1, there 
>> would be quite a lot of unnecessary threads. 
>> 
>> It would be much better to treat your data as a single data input stream 
>> with multiple partitions. You could assign partitions between source 
>> instances based on parallelism. For example with parallelism 6:
>> - source 0 could get partitions 0, 6, 12, 18
>> - source 1, could get partitions 1, 7, …
>> …
>> - source 5, could get partitions 5, 11, ...
>> 
>> Piotrek
>> 
>>> On 9 Nov 2017, at 10:18, Shailesh Jain <shailesh.j...@stellapps.com 
>>> <mailto:shailesh.j...@stellapps.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm trying to understand the runtime aspect of Flink when dealing with 
>>> multiple data streams and multiple operators per data stream.
>>> 
>>> Use case: N data streams in a single flink job (each data stream 
>>> representing 1 device - with different time latencies), and each of these 
>>> data streams gets split into two streams, of which one goes into a bunch of 
>>> CEP operators, and one into a process function.
>>> 
>>> Questions:
>>> 1. At runtime, will the engine create one thread per data stream? Or one 
>>> thread per operator?
>>> 2. Is it possible to dynamically create a data stream at runtime when the 
>>> job starts? (i.e. if N is read from a file when the job starts and 
>>> corresponding N streams need to be created)
>>> 3. Are there any specific performance impacts when a large number of 
>>> streams (N ~ 10000) are created, as opposed to N partitions within a single 
>>> stream?
>>> 
>>> Are there any internal (design) documents which can help understanding the 
>>> implementation details? Any references to the source will also be really 
>>> helpful.
>>> 
>>> Thanks in advance.
>>> 
>>> Shailesh
>>> 
>>> 
>> 
>> 
> 
>

Re: Correlation between data streams/operators and threads

Reply via email to