Re: kafka partitions, data locality

Stefan Richter Fri, 26 Apr 2019 01:15:45 -0700

Hi Sergey,

The point why this I flagged as beta is actually less about stability but more 
about the fact that this is supposed to be more of a "power user" feature 
because bad things can happen if your data is not 100% correctly partitioned in 
the same way as Flink would partition it. This is why typically you should only 
use it if the data was partitioned by Flink and you are very sure what your are 
doing, because the is not really something we can to at the API level to 
protect you from mistakes in using this feature. Eventually some runtime 
exceptions might show you that something is going wrong, but that is not 
exactly a good user experience.


On a different note, there actually is currently one open issue [1] to be aware 
of in connection with this feature and operator chaining, but at the same time 
this is something that should not hard to fix in for the next minor release.

Best,
Stefan

[1] 
https://issues.apache.org/jira/browse/FLINK-12296?focusedCommentId=16824945&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16824945
  

> On 26. Apr 2019, at 09:48, Smirnov Sergey Vladimirovich (39833) 
> <s.smirn...@tinkoff.ru> wrote:
> 
> Hi,
>  
> Dawid, great, thanks!
> Any plans to make it stable? 1.9?
>  
>  
> Regards,
> Sergey
>  
> From: Dawid Wysakowicz [mailto:dwysakow...@apache.org] 
> Sent: Thursday, April 25, 2019 10:54 AM
> To: Smirnov Sergey Vladimirovich (39833) <s.smirn...@tinkoff.ru>; Ken Krugler 
> <kkrugler_li...@transpac.com>
> Cc: user@flink.apache.org; d...@flink.apache.org
> Subject: Re: kafka partitions, data locality
>  
> Hi Smirnov,
> 
> Actually there is a way to tell Flink that data is already partitioned. You 
> can try the reinterpretAsKeyedStream[1] method. I must warn you though this 
> is an experimental feature.
> 
> Best,
> 
> Dawid
> 
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features>
> On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote:
> Hi Ken,
>  
> It’s a bad story for us: even for a small window we have a dozens of 
> thousands events per job with 10x in peaks or even more. And the number of 
> jobs was known to be high. So instead of N operations (our producer/consumer 
> mechanism) with shuffle/resorting (current flink realization) it will be 
> N*ln(N) - the tenfold loss of execution speed!
> 4 all, my next step? Contribute to apache flink? Issues backlog?
>  
>  
> With best regards,
> Sergey
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>] 
> Sent: Wednesday, April 17, 2019 9:23 PM
> To: Smirnov Sergey Vladimirovich (39833) <s.smirn...@tinkoff.ru> 
> <mailto:s.smirn...@tinkoff.ru>
> Subject: Re: kafka partitions, data locality
>  
> Hi Sergey,
>  
> As you surmised, once you do a keyBy/max on the Kafka topic, to group by 
> clientId and find the max, then the topology will have a partition/shuffle to 
> it.
>  
> This is because Flink doesn’t know that client ids don’t span Kafka 
> partitions.
>  
> I don’t know of any way to tell Flink that the data doesn’t need to be 
> shuffled. There was a discussion 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-keying-sub-keying-a-stream-without-repartitioning-td12745.html>
>  about adding a keyByWithoutPartitioning a while back, but I don’t think that 
> support was ever added.
>  
> A simple ProcessFunction with MapState (clientId -> max) should allow you do 
> to the same thing without too much custom code. In order to support 
> windowing, you’d use triggers to flush state/emit results.
>  
> — Ken
>  
>  
> On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) 
> <s.smirn...@tinkoff.ru <mailto:s.smirn...@tinkoff.ru>> wrote:
>  
> Hello,
>  
> We planning to use apache flink as a core component of our new streaming 
> system for internal processes (finance, banking business) based on apache 
> kafka.
> So we starting some research with apache flink and one of the question, 
> arises during that work, is how flink handle with data locality.
> I`ll try to explain: suppose we have a kafka topic with some kind of events. 
> And this events groups by topic partitions so that the handler (or a job 
> worker), consuming message from a partition, have all necessary information 
> for further processing. 
> As an example, say we have client’s payment transaction in a kafka topic. We 
> grouping by clientId (transaction with the same clientId goes to one same 
> kafka topic partition) and the task is to find max transaction per client in 
> sliding windows. In terms of map\reduce there is no needs to shuffle data 
> between all topic consumers, may be it`s worth to do within each consumer to 
> gain some speedup due to increasing number of executors within each partition 
> data.
> And my question is how flink will work in this case. Do it shuffle all data, 
> or it have some settings to avoid this extra unnecessary shuffle/sorting 
> operations?
> Thanks in advance!
>  
>  
> With best regards,
> Sergey Smirnov
>  
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>

Re: kafka partitions, data locality

Reply via email to