[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

Xiangrui Meng (JIRA) Fri, 15 Apr 2016 16:04:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243793#comment-15243793
 ]


Xiangrui Meng commented on SPARK-13944:
---------------------------------------

`mllib-local` by the name is not scoped just for local linear algebra. But 
let's talk about linear algebra library first. MLlib provides implementations 
of standard machine learning algorithms on Spark. Our goal is to cover common 
use cases instead of all that are possible. So some companies and developers 
need to build their own algorithms or modify the implementation in MLlib to 
meet their use cases. To implement algorithms on Spark, a natural choice is to 
use MLlib's linear algebra library, which has good integration with built-in 
MLlib algorithms and DataFrames. But, the issue is: what local linear algebra 
library they should use in online serving? MLlib's local linear algebra library 
is not an option because of its dependency on Spark Core. So people have to 
pick another library or maintain a fork. Neither is ideal due to offline/online 
inconsistency. Separating the linear algebra library out is a clear benefit to 
those developers. Btw, we have to provide linear algebra abstractions in MLlib 
because we cannot expose 3rd-party APIs in Spark public APIs. I think we are on 
the same page about it. 

Next, let's talk about other linear algebra libraries. If there existed a good 
Java linear algebra implementation that met our requirements, I would be more 
than happy to use it. For the requirements and the libraries you listed, please 
see my comment at 
https://issues.apache.org/jira/browse/SPARK-6442?focusedCommentId=14629182&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14629182.
 One thing I didn't mention there is what API compatibility those libraries 
promise. We picked breeze in Spark 1.0 because it was the best candidate at 
that time. If MTJ had a compatible license, I would go with it because it is a 
pure Java library. With breeze now we have to face the issue with Scala 2.12 
compatibility. See https://issues.apache.org/jira/browse/SPARK-14438. There 
have been more linear library coming out since Spark 1.0. It would be great if 
someone can spend time to do the comparison and benchmark again.

On the maintenance side, we have been trying to keep the linear algebra library 
lightweight. DB's PR only contains 4000 lines including test code. While we 
have done a good job to keep it thin, we have also received lots of complaints 
for its lack of features. There is always this trade-offs. We picked the former 
in Spark 1.x due to resource limit. Now with more users and contributors, I 
think we should adjust the balance and provide more features and make both 
users and developers happier. This is mainly for the public APIs. Underneath, 
we can use an existing implementation to avoid duplicate work. But there are 
issues with this approach too, as I mentioned in the previous paragraph.

The reasons listed above should justify the motivation of this JIRA. We can 
briefly talk about model serving and we do we need for local models. Do we need 
to implement local training? Perhaps not. We need model import, transform, and 
maybe online updates. To support other model-serving systems, we just need to 
open up the format we used for pipeline persistence, so it is readable by other 
systems. There are definitely work to do to stabilize the format we use. But 
making exported MLlib models and pipelines readable by other systems is 
certainly what we want to achieve. PMML is really not a good option here. XML 
doesn't seem to be the right format for this purpose and importing PMML is hard 
(at least no easy and Apache-compatible way to do it). We use Parquet and Json, 
both are exchangeable format. On our side, I still think it is still valuable 
for us to provide a lightweight solution for online serving, e.g., local models 
or code generation. It would also make it easier for other systems like 
Prediction.IO because they can use the local models from MLlib directly. We can 
discuss ideas when we start 2.1 development. This is beyond the scope of this 
JIRA.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13944
>                 URL: https://issues.apache.org/jira/browse/SPARK-13944
>             Project: Spark
>          Issue Type: New Feature
>          Components: Build, ML
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: DB Tsai
>            Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

Reply via email to