[ https://issues.apache.org/jira/browse/FLINK-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144732#comment-15144732 ]
Greg Hogan edited comment on FLINK-3291 at 2/12/16 3:48 PM: ------------------------------------------------------------ I started looking at {{MergeIterator}} last year since I had object reuse enabled and was still seeing a flood of newly created objects. The complex code was already written, but tying it together and passing the configuration through resulted in ~5% reduction in overall runtime for the algorithms I was benchmarking. So clearly this is a performance critical section of the code, and as evidenced by this ticket a source of potential errors. [~ggevay], at this point are the changes to {{MergeIterator}} fixing a bug? Do you want to fix up and clarify the documentation for {{MutableObjectIterator}} and verify the implementing classes? My only question on [~StephanEwen]'s comments would be, isn't copying sometimes necessary in non-reusing mode but avoidable when reusing objects? Copies would be required any time an object is presented to the user multiple times, as in joins / cross / cogroup. [~StephanEwen]'s comment on cheap vs expensive copies would be excellent for the user documentation. As I understand, if you're making expensive copies (with object reuse enabled) in a user defined function, you may be better off disabling object reuse and allowing the deserialization to use a fresh object instead. was (Author: greghogan): I started looking at `MergeIterator` last year since I had object reuse enabled and was still seeing a flood of newly created objects. The complex code was already written, but tying it together and passing the configuration through resulted in ~5% reduction in overall runtime for the algorithms I was benchmarking. So clearly this is a performance critical section of the code, and as evidenced by this ticket a source of potential errors. [~ggevay], at this point are the changes to `MergeIterator` fixing a bug? Do you want to fix up and clarify the documentation for `MutableObjectIterator` and verify the implementing classes? My only question on [~StephanEwen]'s comments would be, isn't copying sometimes necessary in non-reusing mode but avoidable when reusing objects? Copies would be required any time an object is presented to the user multiple times, as in joins / cross / cogroup. [~StephanEwen]'s comment on cheap vs expensive copies would be excellent for the user documentation. As I understand, if you're making expensive copies (with object reuse enabled) in a user defined function, you may be better off disabling object reuse and allowing the deserialization to use a fresh object instead. > Object reuse bug in MergeIterator.HeadStream.nextHead > ----------------------------------------------------- > > Key: FLINK-3291 > URL: https://issues.apache.org/jira/browse/FLINK-3291 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime > Affects Versions: 1.0.0 > Reporter: Gabor Gevay > Assignee: Gabor Gevay > Priority: Critical > > MergeIterator.HeadStream.nextHead saves a reference into `this.head` of the > `reuse` object that it got as an argument. This object might be modified > later by the caller. > This actually happens when ReduceDriver.run calls input.next (which will > actually be MergeIterator.next(E reuse)) in the inner while loop of the > objectReuseEnabled branch, and that calls top.nextHead with the reference > that it got from ReduceDriver, which erroneously saves the reference, and > then ReduceDriver later uses that same object for doing the reduce. > Another way in which this fails is when MergeIterator.next(E reuse) gives > `reuse` to different `top`s in different calls, and then the heads end up > being the same object. > You can observe the latter situation in action by running ReducePerformance > here: > https://github.com/ggevay/flink/tree/merge-iterator-object-reuse-bug > Set memory to -Xmx200m (so that the MergeIterator actually has merging to > do), put a breakpoint at the beginning of MergeIterator.next(reuse), and then > watch `reuse`, and the heads of the first two elements of `this.heap` in the > debugger. They will get to be the same object after hitting continue about 6 > times. > You can also look at the count that is printed at the end, which shouldn't be > larger than the key range. Also, if you look into the output file > /tmp/xxxobjectreusebug, for example the key 999977 appears twice. > The good news is that I think I can see an easy fix that doesn't affect > performance: MergeIterator.HeadStream could have a reuse object of its own as > a member, and give that to iterator.next in nextHead(E reuse). And then we > wouldn't need the overload of nextHead that has the reuse parameter, and > MergeIterator.next(E reuse) could just call its other overload. -- This message was sent by Atlassian JIRA (v6.3.4#6332)