Re: kafka partitions, data locality

Dawid Wysakowicz Thu, 25 Apr 2019 00:55:17 -0700

Hi Smirnov,

Actually there is a way to tell Flink that data is already partitioned.
You can try the reinterpretAsKeyedStream[1] method. I must warn you
though this is an experimental feature.


Best,

Dawid

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features

On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:
>
> Hi Ken,
>
>  
>
> It’s a bad story for us: even for a small window we have a dozens of
> thousands events per job with 10x in peaks or even more. And the
> number of jobs was known to be high. So instead of N operations (our
> producer/consumer mechanism) with shuffle/resorting (current flink
> realization) it will be N*ln(N) - the tenfold loss of execution speed!
>
> 4 all, my next step? Contribute to apache flink? Issues backlog?
>
>  
>
>  
>
> With best regards,
>
> Sergey
>
> *From:*Ken Krugler [mailto:kkrugler_li...@transpac.com]
> *Sent:* Wednesday, April 17, 2019 9:23 PM
> *To:* Smirnov Sergey Vladimirovich (39833) <s.smirn...@tinkoff.ru>
> *Subject:* Re: kafka partitions, data locality
>
>  
>
> Hi Sergey,
>
>  
>
> As you surmised, once you do a keyBy/max on the Kafka topic, to group
> by clientId and find the max, then the topology will have a
> partition/shuffle to it.
>
>  
>
> This is because Flink doesn’t know that client ids don’t span Kafka
> partitions.
>
>  
>
> I don’t know of any way to tell Flink that the data doesn’t need to be
> shuffled. There was a discussion
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-keying-sub-keying-a-stream-without-repartitioning-td12745.html>
>  about
> adding a keyByWithoutPartitioning a while back, but I don’t think that
> support was ever added.
>
>  
>
> A simple ProcessFunction with MapState (clientId -> max) should allow
> you do to the same thing without too much custom code. In order to
> support windowing, you’d use triggers to flush state/emit results.
>
>  
>
> — Ken
>
>  
>
>  
>
>     On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833)
>     <s.smirn...@tinkoff.ru <mailto:s.smirn...@tinkoff.ru>> wrote:
>
>      
>
>     Hello,
>
>      
>
>     We planning to use apache flink as a core component of our new
>     streaming system for internal processes (finance, banking
>     business) based on apache kafka.
>
>     So we starting some research with apache flink and one of the
>     question, arises during that work, is how flink handle with data
>     locality.
>
>     I`ll try to explain: suppose we have a kafka topic with some kind
>     of events. And this events groups by topic partitions so that the
>     handler (or a job worker), consuming message from a partition,
>     have all necessary information for further processing. 
>
>     As an example, say we have client’s payment transaction in a kafka
>     topic. We grouping by clientId (transaction with the same clientId
>     goes to one same kafka topic partition) and the task is to find
>     max transaction per client in sliding windows. In terms of
>     map\reduce there is no needs to shuffle data between all topic
>     consumers, may be it`s worth to do within each consumer to gain
>     some speedup due to increasing number of executors within each
>     partition data.
>
>     And my question is how flink will work in this case. Do it shuffle
>     all data, or it have some settings to avoid this extra unnecessary
>     shuffle/sorting operations?
>
>     Thanks in advance!
>
>      
>
>      
>
>     With best regards,
>
>     Sergey Smirnov
>
>  
>
> --------------------------
>
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>  
>

signature.asc
Description: OpenPGP digital signature

Re: kafka partitions, data locality

Reply via email to