[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for memory-based algorithms such 
as neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 2:55 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ] 

Anand Avati edited comment on MAHOUT-1529 at 5/20/14 7:06 AM:
--

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

- 'type DrmTuple[k] = (K, Vector)' is probably better placed in 
spark/../package.scala I think, as it is really an artifact of how the RDD is 
defined. However, BlockifiedDrmTuple[K] probably still belongs to math-scala.


was (Author: avati):
[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-20 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ] 

Anand Avati edited comment on MAHOUT-1529 at 5/20/14 7:03 AM:
--

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?


was (Author: avati):
[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988516#comment-13988516
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1529 at 5/3/14 1:30 AM:
--

it is expensive for anybody's time (compared to REPL adaptation). I certainly 
won't try to do it at this point. If you want to do it, yes, please file a new 
jira.

Also, REPL cannot be used with mahout as is. So yes, it is well warranted what 
we did. 

I am not sure about re-branding, we did too little to warrant that indeed, but 
REPL can't work with Mahout, or at the very least it is awkward to do so 
manually (i.e. tracing all mahout jar dependencies, add them to session and 
make sure all proper imports are done).






was (Author: dlyubimov):
it is expensive for anybody's time. I certainly won't try to do it at this
point. If you want to do it, yes, please file a new jira.





> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988508#comment-13988508
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1529 at 5/3/14 1:15 AM:
--

yes it's cleaner but it is not even clear if it is achievable, and if it is, it 
is expensive. Like i said, you are welcome to try -- if it works with Spark 
identically to REPL, there will be no arguments not to use it in favor of REPL. 

But my budget on this is very limited, so it is not most pragmatical path for 
me to get things done.

Bottom line, for today something that works today beats something hypothetical. 


was (Author: dlyubimov):
yes it's cleaner but it is not even clear if it is achievable, and if it is, it 
is expensive. Like i said, you are welcome to try -- if it works with Spark 
identically to REPL, there will be no arguments not to use it in favor of REPL. 

But my budget on this is very limited, so it is not most pragmatical path for 
me to get things done.

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988088#comment-13988088
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1529 at 5/2/14 7:05 PM:
--

[~avati] the use patterns of optimizer checkpoints are discussed at length in 
my talk. Two basics use cases are explicit management of cache policies and 
common computational path. 

[~ssc] 
bq. Why would we need that explicit execute operator for Stratosphere?

Correct me if i am reading Stratosphere wrong. (I still haven't run a single 
program on it, please forgive me being a bit superficial here). Stratosphere 
programming api implies that we may define more than 1 sink in the graph (i.e. 
writeDRM() calls) without triggering computational action.  How would we 
trigger it if sink definitions such as writeDRM don't trigger it anymore?

Also not clear with collect() stuff, i guess it doesn't have a direct mapping 
either until Stephan finishes his promised piece on it.

-d


was (Author: dlyubimov):
[~avati] the use patterns of optimizer checkpoints are discussed at length in 
my talk. Two basics use cases are explicit management of cache policies and 
common computational path. 

[~ssc] 
bq. Why would we need that explicit execute operator for Stratosphere?

Correct me if i am reading Stratosphere wrong. (I still haven't run a single 
program on it, please forgive me being a bit superfluous here). Stratosphere 
programming api implies that we may define more than 1 sink in the graph (i.e. 
writeDRM() calls) without triggering computational action.  How would we 
trigger it if sink definitions such as writeDRM don't trigger it anymore?

Also not clear with collect() stuff, i guess it doesn't have a direct mapping 
either until Stephan finishes his promised piece on it.

-d

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object



--
This message was sent by Atlassian JIRA
(v6.2#6252)