Greg,

We ran into this Issue when implementing the Mahout bindings for Flink [1].  It 
ended up being the major bottleneck for Mahout on Flink, and makes iterative 
algorithms basically unreasonable.  While it is understook that that Flink's 
Delta-iterations are intended for use when iterating over Flink DataSet, they 
are not always suitable to the task.  Eg. FlinkML's  ALS.scala [2][3] which 
must flush intermediate results to the file system.

In-memory caching is a must have for any type of declarative ML DSL riding on 
top of fink.  Eg, SystemML, Mahout, etc, or an internal Flink DSL.  The Mahout 
bindings are Linear Algebra based.  In close collaboration with the Flink 
community [4] and based on a contribution from an intern hired by Data Artisans 
 to complete this task and other members of the community, we recently finished 
up work on the Mahout Distributed Linear Algebra bindings.

Unfortunately the lack of an in-memory cache made the outcome of this effort 
very sub-optimal.  After the unfinished hand-off of the bindings from the Flink 
community to the Mahout community, we were forced to find a workaround for the 
caching.  We used the template of flushing and executing and flushing results 
to the File system [5], and had to release the bindings as "Experimental".

Anyone building a declarative ML interface using the Flink Dataset API as a 
backend will run into similar issues as has been reported on the stackoverflow 
thread u refer to and it would be great to have this feature.


Its great to see this being talked about.

Andy

[1] http://mahout.apache.org/users/flinkbindings/flink-internals.html
[2] 
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala#L481
[3] 
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L84
[4]https://issues.apache.org/jira/browse/MAHOUT-1570
[5]https://github.com/apache/mahout/blob/master/flink/src/main/scala/org/apache/mahout/flinkbindings/drm/CheckpointedFlinkDrm.scala#L139

________________________________________
From: Greg Hogan <c...@greghogan.com>
Sent: Thursday, May 26, 2016 4:41:53 PM
To: dev@flink.apache.org
Subject: Iteration Intermediate Output

Hi y'all,

I think this is an oft-requested feature [0] and there are many graph
algorithms for which intermediate output is the desired result. I'd like to
take Stephan up on his offer [1] for pointers.

I have yet to get in deep, but I see that iteration tasks are treated
specially as IterationIntermediateTask for synchronization between
supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
walking the program DAG an iteration must be first reached through the tail.

Greg

[0]
http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html

Reply via email to