[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

ASF GitHub Bot (JIRA) Tue, 21 Apr 2015 08:28:02 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505108#comment-14505108
 ]


ASF GitHub Bot commented on FLINK-1297:
---------------------------------------

Github user tammymendt commented on a diff in the pull request:

    https://github.com/apache/flink/pull/605#discussion_r28789034
  
    --- Diff: 
flink-contrib/src/test/java/org/apache/flink/contrib/operatorstatistics/OperatorStatsAccumulatorsTest.java
 ---
    @@ -0,0 +1,144 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.contrib.operatorstatistics;
    +
    +import org.apache.flink.api.common.JobExecutionResult;
    +import org.apache.flink.api.common.accumulators.Accumulator;
    +import org.apache.flink.api.common.functions.RichFlatMapFunction;
    +import org.apache.flink.api.java.ExecutionEnvironment;
    +import org.apache.flink.api.java.io.DiscardingOutputFormat;
    +import org.apache.flink.api.java.tuple.Tuple1;
    +import org.apache.flink.configuration.Configuration;
    +import org.apache.flink.test.util.AbstractTestBase;
    +import org.apache.flink.util.Collector;
    +import org.junit.Assert;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +import java.io.Serializable;
    +import java.util.Map;
    +import java.util.Random;
    +
    +public class OperatorStatsAccumulatorsTest extends AbstractTestBase {
    +
    +   private static final Logger LOG = 
LoggerFactory.getLogger(OperatorStatsAccumulatorsTest.class);
    +
    +   private static final String ACCUMULATOR_NAME = "op-stats";
    +
    +   public OperatorStatsAccumulatorsTest(){
    +           super(new Configuration());
    +   }
    +
    +   @Test
    +   public void testAccumulator() throws Exception {
    +
    +           String input = "";
    +
    +           Random rand = new Random();
    +
    +           for (int i = 1; i < 1000; i++) {
    +                   if(rand.nextDouble()<0.2){
    +                           input+=String.valueOf(rand.nextInt(5))+"\n";
    +                   }else{
    +                           input+=String.valueOf(rand.nextInt(100))+"\n";
    +                   }
    +           }
    +
    +           String inputFile = createTempFile("datapoints.txt", input);
    +
    +           ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
    +
    +           env.readTextFile(inputFile).
    +                           flatMap(new StringToInt()).
    +                           output(new 
DiscardingOutputFormat<Tuple1<Integer>>());
    +
    +           JobExecutionResult result = env.execute();
    +
    +           OperatorStatistics globalStats = 
result.getAccumulatorResult(ACCUMULATOR_NAME);
    +           LOG.debug("Global Stats");
    +           LOG.debug(globalStats.toString());
    +
    +           OperatorStatistics merged = null;
    +
    +           Map<String,Object> accResults = 
result.getAllAccumulatorResults();
    +           for (String accumulatorName:accResults.keySet()){
    +                   if (accumulatorName.contains(ACCUMULATOR_NAME+"-")){
    +                           OperatorStatistics localStats = 
(OperatorStatistics) accResults.get(accumulatorName);
    +                           if (merged == null){
    +                                   merged = localStats.clone();
    +                           }else {
    +                                   merged.merge(localStats);
    +                           }
    +                           LOG.debug("Local Stats: " + accumulatorName);
    +                           LOG.debug(localStats.toString());
    +                   }
    +           }
    +
    +           Assert.assertEquals(globalStats.cardinality,999);
    +           Assert.assertEquals(globalStats.estimateCountDistinct(),100);
    +           Assert.assertTrue(globalStats.getHeavyHitters().size()>0 && 
globalStats.getHeavyHitters().size()<=5);
    +           Assert.assertEquals(merged.getMin(),globalStats.getMin());
    +           Assert.assertEquals(merged.getMax(),globalStats.getMax());
    +           
Assert.assertEquals(merged.estimateCountDistinct(),globalStats.estimateCountDistinct());
    +           
Assert.assertEquals(merged.getHeavyHitters().size(),globalStats.getHeavyHitters().size());
    +
    +   }
    +
    +   public static class StringToInt extends RichFlatMapFunction<String, 
Tuple1<Integer>> {
    +
    +           // Is instantiated later since the runtime context is not yet 
initialized
    +           private Accumulator<Object, Serializable> globalAccumulator;
    +           private Accumulator<Object,Serializable>[] localAccumulators;
    --- End diff --
    
    This is the solution I have found for temporarily storing local, as well as 
global accumulators. But I think there should be a better way to do this, 
rather than having the user have to specify it and later update both inside of 
the UDF.


> Add support for tracking statistics of intermediate results
> -----------------------------------------------------------
>
>                 Key: FLINK-1297
>                 URL: https://issues.apache.org/jira/browse/FLINK-1297
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>            Reporter: Alexander Alexandrov
>            Assignee: Alexander Alexandrov
>             Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

Reply via email to