[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499816#comment-14499816
 ] 

ASF GitHub Bot commented on FLINK-1297:
---------------------------------------

Github user tammymendt commented on a diff in the pull request:

    https://github.com/apache/flink/pull/605#discussion_r28593077
  
    --- Diff: flink-core/pom.xml ---
    @@ -63,6 +63,19 @@ under the License.
                        <artifactId>guava</artifactId>
                        <version>${guava.version}</version>
                </dependency>
    +
    +        <dependency>
    +            <groupId>com.clearspring.analytics</groupId>
    +            <artifactId>stream</artifactId>
    +            <version>2.7.0</version>
    +            <exclusions>
    +                <exclusion>
    +                    <groupId>it.unimi.dsi</groupId>
    +                    <artifactId>fastutil</artifactId>
    --- End diff --
    
    The library we are using from clearspring implements a series of streaming 
algorithms. We are only using HyperLogLog and CountMinSketch to estimate count 
distinct and heavy hitters respectively. As far as I could see fastutil is used 
only in the implementation of a separate streaming algorithm to track 
histograms (QDigest). Since we are not using this particular algorithm, it 
should be ok to exclude fastutil from the dependency. 


> Add support for tracking statistics of intermediate results
> -----------------------------------------------------------
>
>                 Key: FLINK-1297
>                 URL: https://issues.apache.org/jira/browse/FLINK-1297
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>            Reporter: Alexander Alexandrov
>            Assignee: Alexander Alexandrov
>             Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to