[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559700#comment-14559700
 ] 

ASF GitHub Bot commented on FLINK-1319:
---------------------------------------

GitHub user twalthr opened a pull request:

    https://github.com/apache/flink/pull/729

    [FLINK-1319][core] Add static code analysis for UDFs

    This PR implements a Static Code Analyzer (SCA) that uses the ASM framework 
for interpreting Java bytecode of Flink UDFs. The analyzer is build on top of 
ASM's `BasicInterpreter`. Instead of ASM's `BasicValue`s, I introduced 
`TaggedValue`s which extends `BasicValue` and allows for appending interesting 
information to values. Interesting values such as inputs, collectors, or 
constants are tagged such that a tracking of atomic input fields through the 
entire UDF (until the function returns or calls `collect()`) is possible.
    
    The implementation is as conservative as possible meaning that for cases or 
bytecode instructions that haven't been considered the analyzer will fallback 
to the ASM library (which removes TaggedValues).
    
    61 JUnit tests are testing the basic functionality. 18 JUnit tests with 
code examples from the "real world" are testing the analyzer even more.
    
    The analyzer has 3 modes: DISABLED, OPTIMIZE, HINTS
    
    The interpretation takes some time. It is possible that an analysis of an 
UDF takes up to 1 second. Therefore, I didn't enable the analyzer in 
TestEnvironment by default to reduce the build times, but if you uncomment the 
lines the analyzer supports all 280 UDFs within the entire Flink code. 
    
    The analyzer gives hints about:
    - Main feature: ForwardedFields semantic properties for all types of 
Functions except for MapPartition and Combine
    - Warnings if static fields are modified by a Function
    - Warnings if a FilterFunction modifies its input objects
    - Warnings if a Function returns `null`
    - Warnings if a tuple access uses a wrong index
    - Information about the number of object creations within a UDF (for manual 
optimization)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/twalthr/flink sca

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #729
    
----
commit c384fc9740013ec1ae89a2817695078542c47dfe
Author: twalthr <twal...@apache.org>
Date:   2015-05-26T18:22:03Z

    [FLINK-1319][core] Add static code analysis for UDFs

----


> Add static code analysis for UDFs
> ---------------------------------
>
>                 Key: FLINK-1319
>                 URL: https://issues.apache.org/jira/browse/FLINK-1319
>             Project: Flink
>          Issue Type: New Feature
>          Components: Java API, Scala API
>            Reporter: Stephan Ewen
>            Assignee: Timo Walther
>            Priority: Minor
>
> Flink's Optimizer takes information that tells it for UDFs which fields of 
> the input elements are accessed, modified, or frwarded/copied. This 
> information frequently helps to reuse partitionings, sorts, etc. It may speed 
> up programs significantly, as it can frequently eliminate sorts and shuffles, 
> which are costly.
> Right now, users can add lightweight annotations to UDFs to provide this 
> information (such as adding {{@ConstandFields("0->3, 1, 2->1")}}.
> We worked with static code analysis of UDFs before, to determine this 
> information automatically. This is an incredible feature, as it "magically" 
> makes programs faster.
> For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
> works surprisingly well in many cases. We used the "Soot" toolkit for the 
> static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
> not include any of the code so far.
> I propose to add this functionality to Flink, in the form of a drop-in 
> addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
> simply download a special "flink-code-analysis.jar" and drop it into the 
> "lib" folder to enable this functionality. We may even add a script to 
> "tools" that downloads that library automatically into the lib folder. This 
> should be legally fine, since we do not redistribute LGPL code and only 
> dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
> patentability, if I remember correctly).
> Prior work on this has been done by [~aljoscha] and [~skunert], which could 
> provide a code base to start with.
> *Appendix*
> Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
> Papers on static analysis and for optimization: 
> http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> Quick introduction to the Optimizer: 
> http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
> (Section 6)
> Optimizer for Iterations: 
> http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
> (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to