Re: Correlation between data streams/operators and threads

Shailesh Jain Mon, 13 Nov 2017 05:50:37 -0800

Thanks, Piotr. I'll try it out and will get back in case of any further
questions.


Shailesh

On Fri, Nov 10, 2017 at 5:52 PM, Piotr Nowojski <pi...@data-artisans.com>
wrote:

> 1.  It’s a little bit more complicated then that. Each operator chain/task
> will be executed in separate thread (parallelism
>  Multiplies that). You can check in web ui how was your job split into
> tasks.
>
> 3. Yes that’s true, this is an issue. To preserve the individual
> watermarks/latencies (assuming that you have some way to calculate them
> individually per each device), you could either:
>
> a) have separate jobs per each device with parallelism 1. Pros:
> independent failures/checkpoints, Cons: resource usage (number of threads
> increases with number of devices, there are also other resources consumed
> by each job), efficiency,
> b) have one job with multiple data streams. Cons: resource usage (threads)
> c) ignore Flink’s watermarks, and implement your own code in place of it.
> You could read all of your data in single data stream, keyBy
> partition/device and manually handle watermarks logic. You could either try
> to wrap CEP/Window operators or copy/paste and modify them to suite your
> needs.
>
> I would start and try out from a). If it work for your cluster/scale then
> that’s fine. If not try b) (would share most of the code with a), and as a
> last resort try c).
>
> Kostas, would you like to add something?
>
> Piotrek
>
> On 9 Nov 2017, at 19:16, Shailesh Jain <shailesh.j...@stellapps.com>
> wrote:
>
> On 1. - is it tied specifically to the number of source operators or to
> the number of Datastream objects created. I mean does the answer change if
> I read all the data from a single Kafka topic, get a Datastream of all
> events, and the apply N filters to create N individual streams?
>
> On 3. - the problem with partitions is that watermarks cannot be different
> per partition, and since in this use case, each stream is from a device,
> the latency could be different (but order will be correct almost always)
> and there are high chances of loosing out on events on operators like
> Patterns which work with windows. Any ideas for workarounds here?
>
>
> Thanks,
> Shailesh
>
> On 09-Nov-2017 8:48 PM, "Piotr Nowojski" <pi...@data-artisans.com> wrote:
>
> Hi,
>
> 1.
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
> dev/parallel.html
>
> Number of threads executing would be roughly speaking equal to of the
> number of input data streams multiplied by the parallelism.
>
> 2.
> Yes, you could dynamically create more data streams at the job startup.
>
> 3.
> Running 10000 independent data streams on a small cluster (couple of
> nodes) will definitely be an issue, since even with parallelism set to 1,
> there would be quite a lot of unnecessary threads.
>
> It would be much better to treat your data as a single data input stream
> with multiple partitions. You could assign partitions between source
> instances based on parallelism. For example with parallelism 6:
> - source 0 could get partitions 0, 6, 12, 18
> - source 1, could get partitions 1, 7, …
> …
> - source 5, could get partitions 5, 11, ...
>
> Piotrek
>
> On 9 Nov 2017, at 10:18, Shailesh Jain <shailesh.j...@stellapps.com>
> wrote:
>
> Hi,
>
> I'm trying to understand the runtime aspect of Flink when dealing with
> multiple data streams and multiple operators per data stream.
>
> Use case: N data streams in a single flink job (each data stream
> representing 1 device - with different time latencies), and each of these
> data streams gets split into two streams, of which one goes into a bunch of
> CEP operators, and one into a process function.
>
> Questions:
> 1. At runtime, will the engine create one thread per data stream? Or one
> thread per operator?
> 2. Is it possible to dynamically create a data stream at runtime when the
> job starts? (i.e. if N is read from a file when the job starts and
> corresponding N streams need to be created)
> 3. Are there any specific performance impacts when a large number of
> streams (N ~ 10000) are created, as opposed to N partitions within a single
> stream?
>
> Are there any internal (design) documents which can help understanding the
> implementation details? Any references to the source will also be really
> helpful.
>
> Thanks in advance.
>
> Shailesh
>
>
>
>
>
>

Re: Correlation between data streams/operators and threads

Reply via email to