[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

Anand Avati (JIRA) Tue, 20 May 2014 00:08:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002743#comment-14002743
 ]


Anand Avati edited comment on MAHOUT-1529 at 5/20/14 7:06 AM:
--------------------------------------------------------------

[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

- 'type DrmTuple[k] = (K, Vector)' is probably better placed in 
spark/../package.scala I think, as it is really an artifact of how the RDD is 
defined. However, BlockifiedDrmTuple[K] probably still belongs to math-scala.


was (Author: avati):
[~dlyubimov], I had a quick look at the commits, and it looks a lot cleaner 
separation now. Some comments:

- Should DrmLike really be a generic class like DrmLike[T] where T is 
unbounded? For e.g, it does not make sense to have DrmLike[String]. The only 
meaningful ones probably are DrmLike[Int] and DrmLike[Double]. Is there someway 
we can restrict DrmLike to just Int and Double? Or fixate on just Double? While 
RDD supports arbitrary T, H2O supports only numeric types which is sufficient 
for Mahout's needs.

UPDATE: I see that historically DRM's row index need not necessarily be 
numerical. In practice could this be anything other than a number or string?

- I am toying around with the new separation, to build a pure/from scratch 
local/in-memory "backend" which communicates through a ByteArrayStream Java 
serialization. I am hoping this will not only serve as a reference for future 
backend implementors, but also help to keep test cases of the algorithms inside 
math-scala. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-1529
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

Reply via email to