[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243793#comment-15243793 ]
Xiangrui Meng commented on SPARK-13944: --------------------------------------- `mllib-local` by the name is not scoped just for local linear algebra. But let's talk about linear algebra library first. MLlib provides implementations of standard machine learning algorithms on Spark. Our goal is to cover common use cases instead of all that are possible. So some companies and developers need to build their own algorithms or modify the implementation in MLlib to meet their use cases. To implement algorithms on Spark, a natural choice is to use MLlib's linear algebra library, which has good integration with built-in MLlib algorithms and DataFrames. But, the issue is: what local linear algebra library they should use in online serving? MLlib's local linear algebra library is not an option because of its dependency on Spark Core. So people have to pick another library or maintain a fork. Neither is ideal due to offline/online inconsistency. Separating the linear algebra library out is a clear benefit to those developers. Btw, we have to provide linear algebra abstractions in MLlib because we cannot expose 3rd-party APIs in Spark public APIs. I think we are on the same page about it. Next, let's talk about other linear algebra libraries. If there existed a good Java linear algebra implementation that met our requirements, I would be more than happy to use it. For the requirements and the libraries you listed, please see my comment at https://issues.apache.org/jira/browse/SPARK-6442?focusedCommentId=14629182&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14629182. One thing I didn't mention there is what API compatibility those libraries promise. We picked breeze in Spark 1.0 because it was the best candidate at that time. If MTJ had a compatible license, I would go with it because it is a pure Java library. With breeze now we have to face the issue with Scala 2.12 compatibility. See https://issues.apache.org/jira/browse/SPARK-14438. There have been more linear library coming out since Spark 1.0. It would be great if someone can spend time to do the comparison and benchmark again. On the maintenance side, we have been trying to keep the linear algebra library lightweight. DB's PR only contains 4000 lines including test code. While we have done a good job to keep it thin, we have also received lots of complaints for its lack of features. There is always this trade-offs. We picked the former in Spark 1.x due to resource limit. Now with more users and contributors, I think we should adjust the balance and provide more features and make both users and developers happier. This is mainly for the public APIs. Underneath, we can use an existing implementation to avoid duplicate work. But there are issues with this approach too, as I mentioned in the previous paragraph. The reasons listed above should justify the motivation of this JIRA. We can briefly talk about model serving and we do we need for local models. Do we need to implement local training? Perhaps not. We need model import, transform, and maybe online updates. To support other model-serving systems, we just need to open up the format we used for pipeline persistence, so it is readable by other systems. There are definitely work to do to stabilize the format we use. But making exported MLlib models and pipelines readable by other systems is certainly what we want to achieve. PMML is really not a good option here. XML doesn't seem to be the right format for this purpose and importing PMML is hard (at least no easy and Apache-compatible way to do it). We use Parquet and Json, both are exchangeable format. On our side, I still think it is still valuable for us to provide a lightweight solution for online serving, e.g., local models or code generation. It would also make it easier for other systems like Prediction.IO because they can use the local models from MLlib directly. We can discuss ideas when we start 2.1 development. This is beyond the scope of this JIRA. > Separate out local linear algebra as a standalone module without Spark > dependency > --------------------------------------------------------------------------------- > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML > Affects Versions: 2.0.0 > Reporter: Xiangrui Meng > Assignee: DB Tsai > Priority: Blocker > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org