[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan commented on MAHOUT-1529:
--------------------------------------

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-1529
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to