[ https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736617#comment-14736617 ]
ASF GitHub Bot commented on FLINK-1297: --------------------------------------- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/605#issuecomment-138866235 I tried again, this works: ```java @Override public OperatorStatistics clone(){ OperatorStatistics clone = new OperatorStatistics(config); clone.min = min; clone.max = max; clone.cardinality = cardinality; try { ICardinality copy; if (countDistinct instanceof LinearCounting) { copy = new LinearCounting(config.getCountDbitmap()); } else if (countDistinct instanceof HyperLogLog) { copy = new HyperLogLog(config.getCountDlog2m()); } else { throw new IllegalStateException("Unsupported counter."); } clone.countDistinct = copy.merge(countDistinct); } catch (CardinalityMergeException e) { throw new RuntimeException("Faild to clone OperatorStatistics!"); } try { HeavyHitter copy; if (heavyHitter instanceof LossyCounting) { copy = new LossyCounting(config.getHeavyHitterFraction(), config.getHeavyHitterError()); } else if (heavyHitter instanceof CountMinHeavyHitter) { copy = new CountMinHeavyHitter(config.getHeavyHitterFraction(), config.getHeavyHitterError(), config.getHeavyHitterConfidence(), config.getHeavyHitterSeed()); } else { throw new IllegalStateException("Unsupported counter."); } copy.merge(heavyHitter); clone.heavyHitter = copy; } catch (HeavyHitterMergeException e) { throw new RuntimeException("Failed to clone OperatorStatistics!"); } return clone; } ``` Do you think we could merge your pull request with this change? > Add support for tracking statistics of intermediate results > ----------------------------------------------------------- > > Key: FLINK-1297 > URL: https://issues.apache.org/jira/browse/FLINK-1297 > Project: Flink > Issue Type: Improvement > Components: Distributed Runtime > Reporter: Alexander Alexandrov > Assignee: Alexander Alexandrov > Fix For: 0.10 > > Original Estimate: 1,008h > Remaining Estimate: 1,008h > > One of the major problems related to the optimizer at the moment is the lack > of proper statistics. > With the introduction of staged execution, it is possible to instrument the > runtime code with a statistics facility that collects the required > information for optimizing the next execution stage. > I would therefore like to contribute code that can be used to gather basic > statistics for the (intermediate) result of dataflows (e.g. min, max, count, > count distinct) and make them available to the job manager. > Before I start, I would like to hear some feedback form the other users. > In particular, to handle skew (e.g. on grouping) it might be good to have > some sort of detailed sketch about the key distribution of an intermediate > result. I am not sure whether a simple histogram is the most effective way to > go. Maybe somebody would propose another lightweight sketch that provides > better accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)