[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232006#comment-15232006
 ] 

Sean Owen commented on SPARK-13944:
-----------------------------------

I can understand the idea of being able to use the simple API classes (vectors, 
models, etc) as a separable module. However, the theory behind classes like 
{{Vector}} was that they're just Spark-specific wrappers, and not something 
you'd use outside the context of a Spark app. The model classes likewise depend 
on Spark classes like {{RDD}} since they're primarily there to score 
Spark-specific representations of data.

For general interchange, I'd assume one would use general representations -- 
PMML, JSON, etc. And then for scoring not related to Spark, use libraries that 
already exist for this purpose like JPMML.

I still understand that, well, it seems funny to have this Spark 
NaiveBayesModel that can score a single input vector but tell people they can't 
use it unless they drag spark-core into their app. It does mean you're pushing 
Spark towards also becoming a general non-distributed model representation and 
scoring library. Right now, it isn't that, and taking steps to advertise its 
half-baked state as such seem problematic. Same with PMML support -- seems like 
it's better than nothing, but supporting a little PMML has turned out to be 
almost worse than not at all.

Maybe I'm over-thinking this and this is mostly about the {{.linalg}} classes? 
while it's kind of unfortunate that these internal wrapper classes have 
"leaked", that cat may be out of the bag. I recognize that's already a problem 
in that you have to depend on all of mllib to use VectorUDT for example.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-13944
>                 URL: https://issues.apache.org/jira/browse/SPARK-13944
>             Project: Spark
>          Issue Type: New Feature
>          Components: Build, ML
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: DB Tsai
>            Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to