Hi Smirnov, Actually there is a way to tell Flink that data is already partitioned. You can try the reinterpretAsKeyedStream[1] method. I must warn you though this is an experimental feature.
Best, Dawid [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/experimental.html#experimental-features On 19/04/2019 11:48, Smirnov Sergey Vladimirovich (39833) wrote: > > Hi Ken, > > > > It’s a bad story for us: even for a small window we have a dozens of > thousands events per job with 10x in peaks or even more. And the > number of jobs was known to be high. So instead of N operations (our > producer/consumer mechanism) with shuffle/resorting (current flink > realization) it will be N*ln(N) - the tenfold loss of execution speed! > > 4 all, my next step? Contribute to apache flink? Issues backlog? > > > > > > With best regards, > > Sergey > > *From:*Ken Krugler [mailto:kkrugler_li...@transpac.com] > *Sent:* Wednesday, April 17, 2019 9:23 PM > *To:* Smirnov Sergey Vladimirovich (39833) <s.smirn...@tinkoff.ru> > *Subject:* Re: kafka partitions, data locality > > > > Hi Sergey, > > > > As you surmised, once you do a keyBy/max on the Kafka topic, to group > by clientId and find the max, then the topology will have a > partition/shuffle to it. > > > > This is because Flink doesn’t know that client ids don’t span Kafka > partitions. > > > > I don’t know of any way to tell Flink that the data doesn’t need to be > shuffled. There was a discussion > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-keying-sub-keying-a-stream-without-repartitioning-td12745.html> > about > adding a keyByWithoutPartitioning a while back, but I don’t think that > support was ever added. > > > > A simple ProcessFunction with MapState (clientId -> max) should allow > you do to the same thing without too much custom code. In order to > support windowing, you’d use triggers to flush state/emit results. > > > > — Ken > > > > > > On Apr 17, 2019, at 2:33 AM, Smirnov Sergey Vladimirovich (39833) > <s.smirn...@tinkoff.ru <mailto:s.smirn...@tinkoff.ru>> wrote: > > > > Hello, > > > > We planning to use apache flink as a core component of our new > streaming system for internal processes (finance, banking > business) based on apache kafka. > > So we starting some research with apache flink and one of the > question, arises during that work, is how flink handle with data > locality. > > I`ll try to explain: suppose we have a kafka topic with some kind > of events. And this events groups by topic partitions so that the > handler (or a job worker), consuming message from a partition, > have all necessary information for further processing. > > As an example, say we have client’s payment transaction in a kafka > topic. We grouping by clientId (transaction with the same clientId > goes to one same kafka topic partition) and the task is to find > max transaction per client in sliding windows. In terms of > map\reduce there is no needs to shuffle data between all topic > consumers, may be it`s worth to do within each consumer to gain > some speedup due to increasing number of executors within each > partition data. > > And my question is how flink will work in this case. Do it shuffle > all data, or it have some settings to avoid this extra unnecessary > shuffle/sorting operations? > > Thanks in advance! > > > > > > With best regards, > > Sergey Smirnov > > > > -------------------------- > > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra > > >
signature.asc
Description: OpenPGP digital signature