[
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231948#comment-15231948
]
Nick Pentreath edited comment on SPARK-13944 at 4/8/16 11:09 AM:
-
What's the reasoning behind breaking changes in {{ml}} API and not in
{{mllib}}? It seems to me that if we're breaking one API, we may as well break
both, and make a clean break rather than keep a bunch of essentially deprecated
cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2,
2.3). If we broke explicitly without trying to "half-maintain" back compat,
it's also very clear to everyone what's broken. While converting back and forth
may be more error prone in the long run.
Also, in practice the actual breaking change is mostly for (a) 3rd party
developers developing their own models and Pipeline components; (b) users
creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}}
API, or creating raw {{Vector}} DataFrame columns or working with udfs over
{{Vector}} in {{ml}} API). Across the board the only change required is simply
replacing {{mllib}} with {{ml}} in the imports.
Now, if we use the type alias, Scala users don't even need to make that change!
As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset}}
in ALL their code. Changing imports for linalg components from {{mllib}} ->
{{ml}} seems as onerous (or as "not very onerous").
was (Author: mlnick):
What's the reasoning behind breaking changes in {{ml}} API and not in
{{mllib}}? It seems to me that if we're breaking one API, we may as well break
both, and make a clean break rather than keep a bunch of essentially deprecated
cruft around (though I guess we could deprecate in 2.0 and remove in say 2.2,
2.3). If we broke explicitly without trying to "half-maintain" back compat,
it's also very clear to everyone what's broken. While converting back and forth
may be more error prone in the long run.
Also, in practice the actual breaking change is mostly for (a) 3rd party
developers developing their own models and Pipeline components; (b) users
creating input datasets (data -> {{LabeledPoint}} or {{Vector}} in {{mllib}}
API, or creating raw {{Vector}} DataFrame columns or working with udfs over
{{Vector}}s in {{ml}} API). Across the board the only change required is simply
replacing {{mllib}} with {{ml}} in the imports.
Now, if we use the type alias, Scala users don't even need to make that change!
As for Java users, ALL of them need to change {{DataFrame}} -> {{Dataset}}
in ALL their code. Changing imports for linalg components from {{mllib}} ->
{{ml}} seems as onerous (or as "not very onerous").
> Separate out local linear algebra as a standalone module without Spark
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
> Issue Type: New Feature
> Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency
> to simplify production deployment. We can call the new module
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will
> be changed from `org.apache.spark.mllib.linalg.Vector` to
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML
> pipeline will be the one in ML package; however, the existing mllib code will
> not be touched. As a result, this will potentially break the API. Also, when
> the vector is loaded from mllib vector by Spark SQL, the vector will
> automatically converted into the one in ml package.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org