[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326101#comment-15326101
 ] 

Matthias Boehm commented on SPARK-15882:
----------------------------------------

I really like this direction and think it has the potential to become a higher 
level API for Spark ML, as data frames and data sets have become for Spark SQL.

If there is interest, we'd like to help contributing to this feature by porting 
over a subset of distributed linear algebra operations from SystemML.

General Goals: From my perspective, we should aim for an API that hides the 
underlying data representation (e.g., RDD/Dataset, sparse/dense, blocking 
configurations, block/row/coordinate, partitioning etc). Furthermore, it would 
be great to make it easy to swap out the used local matrix library. This 
approach would allow people to plug in their custom operations (e.g., native 
BLAS libraries/kernels or compressed block operations), while still relying on 
a common API and scheme for distributing blocks.

RDDs over Datasets: For the internal implementation, I would favor RDDs over 
Datasets because (1) RDDs allow for more flexibility (e.g., reduceByKey, 
combineByKey, partitioning-preserving operations), and (2) encoders don't offer 
much benefit for blocked representations as the per-block overhead is typically 
negligible. 

Basic Operations: Initially, I would start with a small well-defined set of 
operations including matrix multiplications, unary and binary operations (e.g., 
arithmetic/comparison), unary aggregates (e.g., sum/rowSums/colSums, 
min/max/mean/sd), reorg operations (transpose/diag/reshape/order), and 
cumulative aggregates (e.g., cumsum).

Towards Optimization: Internally, we could implement alternative operations but 
hide them under a common interface. For example, matrix multiplication would be 
exposed as 'multiply' (consistent with local linalg) - internally, however, we 
would select between alternative operations (see 
https://github.com/apache/incubator-systemml/blob/master/docs/devdocs/MatrixMultiplicationOperators.txt),
 based on a simple rule set or user-provided hints as done in Spark SQL. Later, 
we could think about a more sophisticated optimizer, potentially relying on the 
existing catalyst infrastructure. What do you think? 

> Discuss distributed linear algebra in spark.ml package
> ------------------------------------------------------
>
>                 Key: SPARK-15882
>                 URL: https://issues.apache.org/jira/browse/SPARK-15882
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to