Re: questions about Flink's HashJoin performance

2017-05-18 Thread Fabian Hueske
Hi, I'm not aware of a performance report for this feature. I don't think it is well known or used a lot. The classes to check out for prepartitioned / presorted data are SplitDataProperties [1], DataSource [2], and as an example PropertyDataSourceTest [3]. [1] https://github.com/apache/flink/blo

Re: questions about Flink's HashJoin performance

2017-05-18 Thread weijie tong
thanks for tip @Stephan. To [1] , there's a description about "I’ve got sooo much data to join, do I really need to ship it?" . How to configure Flink to touch that target? Is there a performance report ? [1] : https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

Re: questions about Flink's HashJoin performance

2017-05-16 Thread Stephan Ewen
Hi! Be aware that the "Row" and "Record" types are not very high performance data types. You might be measuring the data type overhead, rather than the hash table performance. Also, the build measurements include the data generation, which influences the results. If you want to purely benchmark t

Re: questions about Flink's HashJoin performance

2017-05-16 Thread weijie tong
Thanks for all your enthusiastic response. Yes, My target was to try to find the best performance in memory. I got that. On Tue, 16 May 2017 at 4:10 PM Fabian Hueske wrote: > Hi, > > Flink's HashJoin implementation was designed to gracefully handle inputs > that exceed the main memory. > It is no

Re: questions about Flink's HashJoin performance

2017-05-16 Thread Fabian Hueske
Hi, Flink's HashJoin implementation was designed to gracefully handle inputs that exceed the main memory. It is not explicitly optimized for in-memory processing and does not play fancy tricks like optimizing cache accesses or batching. I assume your benchmark is about in-memory joins only. This w

Re: questions about Flink's HashJoin performance

2017-05-15 Thread weijie tong
The Flink version is 1.2.0 On Mon, May 15, 2017 at 10:24 PM, weijie tong wrote: > @Till thanks for your reply. > > My code is similar to HashTableITCase.testInMemoryMutableHashTable() > . It just use the MutableHashTable class , there's no other Flink's > configuration. The main code body is

Re: questions about Flink's HashJoin performance

2017-05-15 Thread weijie tong
@Till thanks for your reply. My code is similar to HashTableITCase.testInMemoryMutableHashTable() . It just use the MutableHashTable class , there's no other Flink's configuration. The main code body is: this.recordBuildSideAccessor = RecordSerializer.get(); > this.recordProbeSideAccessor =

Re: questions about Flink's HashJoin performance

2017-05-15 Thread Till Rohrmann
Hi Weijie, it might be the case that batching the processing of multiple rows can give you an improved performance compared to single row processing. Maybe you could share the exact benchmark base line results and the code you use to test Flink's MutableHashTable with us. Also the Flink configura

questions about Flink's HashJoin performance

2017-05-13 Thread weijie tong
I has a test case to use Flink's MutableHashTable class to do a hash join on a local machine with 64g memory, 64cores. The test case is one build table with 14w rows ,one probe table with 320w rows ,the matched result rows is 12 w. It takes 2.2 seconds to complete the join.The performance seems ba