[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376899#comment-14376899
 ] 

Stephan Ewen commented on FLINK-1319:
-------------------------------------

Very nice result, a very much anticipated feature.

Can you tell us how many functions are currently analyzed by this? Does the 
basic mechanism work with record-at-a-time functions only, or also with 
group-at-a-time functions?

To proceed:
  - Do we nee an extra project for this? I would actually not mind having this 
in core / java. It is sort of lightweight and we have the ASM dependency 
anyways (closure cleaning).
  - To activate or deactivate it, I would use the ExecutionConfig in the 
ExecutionEnvironment. From my experience with users, no one bothers to call any 
of the parametrization methods ever (withForwardFields, withName, analyzeUdf, 
...). If we make it dependent on that, it will effectively not be used.
  - I would have it deactivated by default initially. Users can activate it 
globally with the ExecutionConfig. We should have it activated it in all test 
to give the code coverage with our test UDFs. This can be done centralized, 
where the test context environments are created.
  - We can activate it by default in the next release, once we have given this 
some testing and exposure.

Other comments:
  - I would vote to throw an exception (or at least print a warning) if you 
detect that any path in the program returns a null value.
  - ASM dependency versions needs to be set by a variable (defined in root pom, 
interaction with shading)
  - Can you format the POM xml like the other POMs (tabs) ?


> Add static code analysis for UDFs
> ---------------------------------
>
>                 Key: FLINK-1319
>                 URL: https://issues.apache.org/jira/browse/FLINK-1319
>             Project: Flink
>          Issue Type: New Feature
>          Components: Java API, Scala API
>            Reporter: Stephan Ewen
>            Assignee: Timo Walther
>            Priority: Minor
>
> Flink's Optimizer takes information that tells it for UDFs which fields of 
> the input elements are accessed, modified, or frwarded/copied. This 
> information frequently helps to reuse partitionings, sorts, etc. It may speed 
> up programs significantly, as it can frequently eliminate sorts and shuffles, 
> which are costly.
> Right now, users can add lightweight annotations to UDFs to provide this 
> information (such as adding {{@ConstandFields("0->3, 1, 2->1")}}.
> We worked with static code analysis of UDFs before, to determine this 
> information automatically. This is an incredible feature, as it "magically" 
> makes programs faster.
> For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
> works surprisingly well in many cases. We used the "Soot" toolkit for the 
> static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
> not include any of the code so far.
> I propose to add this functionality to Flink, in the form of a drop-in 
> addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
> simply download a special "flink-code-analysis.jar" and drop it into the 
> "lib" folder to enable this functionality. We may even add a script to 
> "tools" that downloads that library automatically into the lib folder. This 
> should be legally fine, since we do not redistribute LGPL code and only 
> dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
> patentability, if I remember correctly).
> Prior work on this has been done by [~aljoscha] and [~skunert], which could 
> provide a code base to start with.
> *Appendix*
> Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
> Papers on static analysis and for optimization: 
> http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> Quick introduction to the Optimizer: 
> http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
> (Section 6)
> Optimizer for Iterations: 
> http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
> (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to