Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era

Thanks so much to you Shane for all you work over 10 (!!) years. And to
Amplab also!

Farewell Spark Jenkins!

N

On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas 
wrote:

> Farewell to Jenkins and its classic weather forecast build status icons:
>
> [image: health-80plus.png][image: health-60to79.png][image:
> health-40to59.png][image: health-20to39.png][image: health-00to19.png]
>
> And thank you Shane for all the help over these years.
>
> Will you be nuking all the Jenkins-related code in the repo after the 23rd?
>
> On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:
>
>> hey everyone!
>>
>> after a marathon run of nearly a decade, we're finally going to be
>> shutting down {amp|rise}lab jenkins at the end of this month...
>>
>> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>>
>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>
>> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
>> system, but technology has moved on and running a dedicated set of servers
>> for just one open source project is just too expensive for us here at uc
>> berkeley.
>>
>> if there's interest, i'll fire up a zoom session and all y'alls can watch
>> me type the final command:
>>
>> systemctl stop jenkins
>>
>> feeling bittersweet,
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome!


On Fri, 26 Mar 2021 at 22:22, Matei Zaharia  wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers!

On Wed, 15 Jul 2020 at 06:59, Prashant Sharma  wrote:

> Congratulations all ! It's great to have such committed folks as
> committers. :)
>
> On Wed, Jul 15, 2020 at 9:24 AM Yi Wu  wrote:
>
>> Congrats!!
>>
>> On Wed, Jul 15, 2020 at 8:02 AM Hyukjin Kwon  wrote:
>>
>>> Congrats!
>>>
>>> 2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 님이 작성:
>>>
 Congrats, all!

 On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
 wrote:

> Congrats and welcome!
>
> On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler 
> wrote:
>
>> Congratulations and welcome!
>>
>> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
>> wrote:
>>
>>> Welcome, Huaxin, Jungtaek, and Dilip!
>>>
>>> Congratulations!
>>>
>>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add several new committers. Please
 join me in welcoming them to their new roles! The new committers are:

 - Huaxin Gao
 - Jungtaek Lim
 - Dilip Biswal

 All three of them contributed to Spark 3.0 and we’re excited to
 have them join the project.

 Matei and the Spark PMC

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>
> --
> Takuya UESHIN
>
>

 --
 ---
 Takeshi Yamamuro

>>>


Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
Generally I would say 10s is a bit low, while a few 100s+ starts to make
sense. Of course it depends a lot on the specific use case, item catalogue
etc, user experience / platform, etc.

On Wed, Jun 26, 2019 at 3:57 PM Steve Pruitt  wrote:

> I should have mentioned this is a synthetic dataset I create using some
> likelihood distributions of the rating values.  I am only experimenting /
> learning.  In practice though, the list of items is likely to be at least
> in the 10’s if not 100’s.  Are even this item numbers to low?
>
>
>
> Thanks.
>
>
>
> -S
>
>
>
> *From:* Nick Pentreath 
> *Sent:* Wednesday, June 26, 2019 9:09 AM
> *To:* user@spark.apache.org
> *Subject:* Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm
>
>
>
> If the number of items is indeed 4, then another issue is the rank of the
> factors defaults to 10. Setting the "rank" parameter < 4 will help.
>
>
>
> However, if you only have 4 items, then I would propose that using ALS (or
> any recommendation model in fact) is not really necessary. There is not
> really enough information as well as sparsity, to make collaborative
> filtering useful. And you could simply recommend all items a user has not
> rated and the result would be the same essentially.
>
>
>
>
>
> On Wed, Jun 26, 2019 at 3:03 PM Steve Pruitt  wrote:
>
> Number of users is 1055
>
> Number of items is 4
>
> Ratings values are either 120, 20, 0
>
>
>
>
>
> *From:* Nick Pentreath 
> *Sent:* Wednesday, June 26, 2019 6:03 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] - Re: Problem with the ML ALS algorithm
>
>
>
> This means that the matrix that ALS is trying to factor is not positive
> definite. Try increasing regParam (try 0.1, 1.0 for example).
>
>
>
> What does the data look like? e.g. number of users, number of items,
> number of ratings, etc?
>
>
>
> On Wed, Jun 26, 2019 at 12:06 AM Steve Pruitt 
> wrote:
>
> I get an inexplicable exception when trying to build an ALSModel with the
> implicit set to true.  I can’t find any help online.
>
>
>
> Thanks in advance.
>
>
>
> My code is:
>
>
>
> ALS als = new ALS()
>
> .setMaxIter(5)
>
> .setRegParam(0.01)
>
> .setUserCol("customer")
>
> .setItemCol("item")
>
> .setImplicitPrefs(true)
>
> .setRatingCol("rating");
>
> ALSModel model = als.fit(training);
>
>
>
> The exception is:
>
> org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6
> because A is not positive definite. Is A derived from a singular matrix
> (e.g. collinear column values)?
>
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:41)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
> at
> org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:747)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
>


Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
If the number of items is indeed 4, then another issue is the rank of the
factors defaults to 10. Setting the "rank" parameter < 4 will help.

However, if you only have 4 items, then I would propose that using ALS (or
any recommendation model in fact) is not really necessary. There is not
really enough information as well as sparsity, to make collaborative
filtering useful. And you could simply recommend all items a user has not
rated and the result would be the same essentially.


On Wed, Jun 26, 2019 at 3:03 PM Steve Pruitt  wrote:

> Number of users is 1055
>
> Number of items is 4
>
> Ratings values are either 120, 20, 0
>
>
>
>
>
> *From:* Nick Pentreath 
> *Sent:* Wednesday, June 26, 2019 6:03 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] - Re: Problem with the ML ALS algorithm
>
>
>
> This means that the matrix that ALS is trying to factor is not positive
> definite. Try increasing regParam (try 0.1, 1.0 for example).
>
>
>
> What does the data look like? e.g. number of users, number of items,
> number of ratings, etc?
>
>
>
> On Wed, Jun 26, 2019 at 12:06 AM Steve Pruitt 
> wrote:
>
> I get an inexplicable exception when trying to build an ALSModel with the
> implicit set to true.  I can’t find any help online.
>
>
>
> Thanks in advance.
>
>
>
> My code is:
>
>
>
> ALS als = new ALS()
>
> .setMaxIter(5)
>
> .setRegParam(0.01)
>
> .setUserCol("customer")
>
> .setItemCol("item")
>
> .setImplicitPrefs(true)
>
> .setRatingCol("rating");
>
> ALSModel model = als.fit(training);
>
>
>
> The exception is:
>
> org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6
> because A is not positive definite. Is A derived from a singular matrix
> (e.g. collinear column values)?
>
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:41)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
> at
> org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:747)
> ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
>
>


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Nick Pentreath
For ONNX you may be interested in https://github.com/onnx/onnxmltools -
which supports conversion of a few skelarn models to ONNX already.

However as far as I am aware, none of the ONNX backends actually support
the ONNX-ML extended spec (in open-source at least). So you would not be
able to actually do prediction I think...

As for PFA, to my current knowledge there is no library that does it yet.
Our own Aardpfark project (https://github.com/CODAIT/aardpfark) focuses on
SparkML export to PFA for now but would like to add sklearn support in the
future.


On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka 
wrote:

> The ONNX-approach sounds most promising, esp. because it will also allow
> library interoperability but I wonder if this is for parametric models only
> and not for the nonparametric ones like KNN, tree-based classifiers, etc.
>
> All-in-all I can definitely see the appeal for having a way to export
> sklearn estimators in a text-based format (e.g., via JSON), since it would
> make sharing code easier. This doesn't even have to be compatible with
> multiple sklearn versions. A typical use case would be to include these
> JSON exports as e.g., supplemental files of a research paper for other
> people to run the models etc. (here, one can just specify which sklearn
> version it would require; of course, one could also share pickle files, by
> I am personally always hesitant reg. running/trusting other people's pickle
> files).
>
> Unfortunately though, as Gael pointed out, this "feature" would be a huge
> burden for the devs, and it would probably also negatively impact the
> development of scikit-learn itself because it imposes another design
> constraint.
>
> However, I do think this sounds like an excellent case for a contrib
> project. Like scikit-export, scikit-serialize or sth like that.
>
> Best,
> Sebastian
>
>
>
> > On Oct 3, 2018, at 5:49 AM, Javier López  wrote:
> >
> >
> > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:
> > The reason that pickles are brittle and that sharing pickles is a bad
> > practice is that pickle use an implicitly defined data model, which is
> > defined via the internals of objects.
> >
> > Plus the fact that loading a pickle can execute arbitrary code, and
> there is no way to know
> > if any malicious code is in there in advance because the contents of the
> pickle cannot
> > be easily inspected without loading/executing it.
> >
> > So, the problems of pickle are not specific to pickle, but rather
> > intrinsic to any generic persistence code [*]. Writing persistence code
> that
> > does not fall in these problems is very costly in terms of developer time
> > and makes it harder to add new methods or improve existing one. I am not
> > excited about it.
> >
> > My "text-based serialization" suggestion was nowhere near as ambitious
> as that,
> > as I have already explained, and wasn't aiming at solving the versioning
> issues, but
> > rather at having something which is "about as good" as pickle but in a
> human-readable
> > format. I am not asking for a Turing-complete language to reproduce the
> prediction
> > function, but rather something simple in the spirit of the output
> produced by the gist code I linked above, just for the model families where
> it is reasonable:
> >
> > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> >
> > The code I posted mostly works (specific cases of nested models need to
> be addressed
> > separately, as well as pipelines), and we have been using (a version of)
> it in production
> > for quite some time. But there are hackish aspects to it that we are not
> happy with,
> > such as the manual separation of init and fitted parameters by checking
> if the name ends with "_", having to infer class name and location using
> > "model.__class__.__name__" and "model.__module__", and the wacky use of
> "__import__".
> >
> > My suggestion was more along the lines of adding some metadata to
> sklearn estimators so
> > that a code in a similar style would be nicer to write; little things
> like having a `init_parameters` and `fit_parameters` properties that would
> return the lists of named parameters,
> > or a `model_info` method that would return data like sklearn version,
> class name and location, or a package level dictionary pointing at the
> estimator classes by a string name, like
> >
> > from sklearn.linear_models import LogisticRegression
> > estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> >
> > so that one can load the appropriate class from the string description
> without calling __import__ or eval; that sort of stuff.
> >
> > I am aware this would not address the common complain of "prefect
> prediction reproducibility"
> > across versions, but I think we can all agree that this utopia of
> perfect reproducibility is not
> > feasible.
> >
> > And in the long, long run, I agree that PFA/onnx or whichever similar
> format that emerges, is
> > the 

[jira] [Resolved] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-25412.

Resolution: Not A Bug

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613160#comment-16613160
 ] 

Nick Pentreath commented on SPARK-25412:


(1) is by design. Feature hashing does not store the exact mapping from feature 
values to vector indices and so is a one way transform. Hashing gives you speed 
and requires almost no memory, but you give up the reverse mapping and you have 
the potential for hash collisions.

(2) is again by design for now. There are ways to have the sign of the feature 
value be determined also as part of a hash function, and in expectation the 
collisions zero each other out. This may be added in future work.

The impact of hash collisions can be reduced by increasing the {{numFeatures}} 
parameter. The default is probably reasonable for small to medium feature 
dimensions but should probably be increased when working with very 
high-cardinality features.

 

I don't think this can be classed as a bug as these are all design and 
tradeoffs of using feature hashing 

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-19 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516861#comment-16516861
 ] 

Nick Pentreath commented on SPARK-24467:


One option is to do that same as we did for one hot encoder: we could create a 
new Estimator/Model pair, and deprecate the old one, for 2.4.0. Then for 3.0, 
we could remove the old one.

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel. Though perhaps the 
existing one can be made a Model without breaking things}}


was (Author: mlnick):
Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath commented on SPARK-24467:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau  wrote:

> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> maximilianofel...@gmail.com> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal 
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly  wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> 
>>>  @5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> 
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI*  (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>  (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> *
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung 
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> 
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that would be too much noise for dev@)
>>>
>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>> post notes after and follow up online. As for Seattle, I would be very
>>> interested to meet in person lateen and discuss ;)
>>>
>>>
>>> _
>>> From: Saikat Kanjilal 
>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Maximiliano Felice 
>>> Cc: Felix Cheung , Holden Karau <
>>> hol...@pigscanfly.ca>, Joseph Bradley , Leif
>>> Walsh , dev 
>>>
>>>
>>> Would love to join but am in Seattle, thoughts on how to make this work?
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>> maximilianofel...@gmail.com> wrote:
>>>
>>> Big +1 to a meeting with fresh air.
>>>
>>> Could anyone send the invites? I don't really know which is the place
>>> Holden is talking about.
>>>
>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung :
>>>
 You had me at blue bottle!


Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0

The PR is still in progress I think - should be available in 2.4.0

On Mon, 14 May 2018 at 22:32, Mina Aslani  wrote:

> Please take a look at the api doc:
> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
>
> On Mon, May 14, 2018 at 4:30 PM, Mina Aslani  wrote:
>
>> Hi,
>>
>> There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
>> How multiple input/output columns can be specified then?
>>
>> Regards,
>> Mina
>>
>
>


Re: A naive ML question

2018-04-29 Thread Nick Pentreath
One potential approach could be to construct a transition matrix showing
the probability of moving from each state to another state. This can be
visualized with a “heat map” encoding (I think matshow in numpy/matplotlib
does this).

On Sat, 28 Apr 2018 at 21:34, kant kodali  wrote:

> Hi,
>
> I mean a transaction goes typically goes through different states like
> STARTED, PENDING, CANCELLED, COMPLETED, SETTLED etc...
>
> Thanks,
> kant
>
> On Sat, Apr 28, 2018 at 4:11 AM, Jörn Franke  wrote:
>
>> What do you mean by “how it evolved over time” ? A transaction describes
>> basically an action at a certain point of time. Do you mean how a financial
>> product evolved over time given a set of a transactions?
>>
>> > On 28. Apr 2018, at 12:46, kant kodali  wrote:
>> >
>> > Hi All,
>> >
>> > I have a bunch of financial transactional data and I was wondering if
>> there is any ML model that can give me a graph structure for this data?
>> other words, show how a transaction had evolved over time?
>> >
>> > Any suggestions or references would help.
>> >
>> > Thanks!
>> >
>>
>
>


Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
this use case in a more natural way than HashingTF (and handles multiple
columns at once).



On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin 
wrote:

> Hi Shahab,
>
> do you actually need to have a few columns with such a huge amount of
> categories whose value depends on original value's frequency?
>
> If no, then you may use value's hash code as a category or combine all
> columns into a single vector using HashingTF.
>
> Regards,
> Filipp.
>
> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus 
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that needs to be indexed is huge and columns to be
> indexed
> > are high cardinality (or with lots of categories) and more than one such
> > column need to be indexed? Meaning it wouldn't fit in memory.
> >
> > Thanks.
> >
> > Regards,
> > Shahab
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-03 Thread Nick Pentreath
Congratulations!

On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G)  wrote:

>
>
> Thanks everyone! It’s my great pleasure to be part of such a professional
> and innovative community!
>
>
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>


Re: Spark MLlib: Should I call .cache before fitting a model?

2018-02-27 Thread Nick Pentreath
Currently, fit for many (most I think) models will cache the input data.
For LogisticRegression this is definitely the case, so you won't get any
benefit from caching it yourself.

On Tue, 27 Feb 2018 at 21:25 Gevorg Hari  wrote:

> Imagine that I am training a Spark MLlib model as follows:
>
> val traingData = loadTrainingData(...)val logisticRegression = new 
> LogisticRegression()
>
> traingData.cacheval logisticRegressionModel = 
> logisticRegression.fit(trainingData)
>
> Does the call traingData.cache improve performances at training time or
> is it not needed?
>
> Does the .fit(...) method for a ML algorithm call cache/unpersist
> internally?
>
>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding)

Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed.

Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume
packages built)

On Tue, 27 Feb 2018 at 10:09 Felix Cheung  wrote:

> +1
>
> Tested R:
>
> install from package, CRAN tests, manual tests, help check, vignettes check
>
> Filed this https://issues.apache.org/jira/browse/SPARK-23461
> This is not a regression so not a blocker of the release.
>
> Tested this on win-builder and r-hub. On r-hub on multiple platforms
> everything passed. For win-builder tests failed on x86 but passed x64 -
> perhaps due to an intermittent download issue causing a gzip error,
> re-testing now but won’t hold the release on this.
>
> --
> *From:* Nan Zhu 
> *Sent:* Monday, February 26, 2018 4:03:22 PM
> *To:* Michael Armbrust
> *Cc:* dev
> *Subject:* Re: [VOTE] Spark 2.3.0 (RC5)
>
> +1  (non-binding), tested with internal workloads and benchmarks
>
> On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust  > wrote:
>
>> +1 all our pipelines have been running the RC for several days now.
>>
>> On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
>> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
>>> wrote:
>>>
 +1 (non-binding)

 On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:

> +1 (binding) in Spark SQL, Core and PySpark.
>
> Xiao
>
> 2018-02-24 14:49 GMT-08:00 Ricardo Almeida <
> ricardo.alme...@actnowib.com>:
>
>> +1 (non-binding)
>>
>> same as previous RC
>>
>> On 24 February 2018 at 11:10, Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>>>
 +1
 Tests passed and additionally ran Arrow related tests and did some
 perf checks with python 2.7.14

 On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau  wrote:

> Note: given the state of Jenkins I'd love to see Bryan Cutler or
> someone with Arrow experience sign off on this release.
>
> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian  > wrote:
>
>> +1 (binding)
>>
>> Passed all the tests, looks good.
>>
>> Cheng
>>
>> On 2/23/18 15:00, Holden Karau wrote:
>>
>> +1 (binding)
>> PySpark artifacts install in a fresh Py3 virtual env
>>
>> On Feb 23, 2018 7:55 AM, "Denny Lee" 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>>> joshgoldsboroughs...@gmail.com> wrote:
>>>
 New to testing out Spark RCs for the community but I was able
 to run some of the basic unit tests without error so for what it's 
 worth,
 I'm a +1.

 On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
 samee...@apache.org> wrote:

> Please vote on releasing the following candidate as Apache
> Spark version 2.3.0. The vote is open until Tuesday February 27, 
> 2018 at
> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 
> votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see
> https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc5:
> https://github.com/apache/spark/tree/v2.3.0-rc5
> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>
> List of JIRA tickets resolved in this release can be found
> here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be
> found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1266/
>
> The documentation corresponding to this release can be found
> at:
>
> 

[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366809#comment-16366809
 ] 

Nick Pentreath commented on SPARK-23265:


Thanks for the ping - yes it adds more detailed checking of the exclusive 
params and would introduce an error being thrown in certain additional 
situations (specifically {{numBucketsArray}} set for single-column transform, 
{{numBuckets}} and {{numBucketsArray}} set for multi-column transform, 
mismatched length of {{numBucketsArray}} with input/output columns for 
multi-column transform).

I reviewed the PR and LGTM so as I said there we can merge this now before RC4 
gets cut.

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
\{{numBuckets}} when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366744#comment-16366744
 ] 

Nick Pentreath commented on SPARK-23437:


It sounds interesting - however the standard practice is that new algorithms 
should probably be released as a 3rd party Spark package. If they become 
widely-used then there is a stronger argument for integration into MLlib.

See [http://spark.apache.org/contributing.html] under the MLlib section for 
more details. 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to
a Blocker. It should be fixed before release.

On Thu, 15 Feb 2018 at 07:25 Holden Karau  wrote:

> If this is a blocker in your view then the vote thread is an important
> place to mention it. I'm not super sure all of the places these methods are
> used so I'll defer to srowen and folks, but for the ML related implications
> in the past we've allowed people to set the hashing function when we've
> introduced changes.
>
> On Feb 15, 2018 2:08 PM, "mrkm4ntr"  wrote:
>
>> I was advised to post here in the discussion at GitHub. I do not know
>> what to
>> do about the problem that discussions dispersing in two places.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362182#comment-16362182
 ] 

Nick Pentreath commented on SPARK-23377:


Should this be a blocker for 2.3? I think so since it should really be fixed 
before release.

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it:
https://issues.apache.org/jira/browse/SPARK-3155.

It is probably still a useful feature to have for trees but the priority is
not that high since it may not be that useful for the tree ensemble models.

On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with Spark
> (2.2.1, but the behavior is the same with the latest code on the repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an InternalNode,
> could simply be a LeafNode with the (shared) prediction.
>
> I know that decision trees computed by scikit-learn share the same
> feature, I understand that this is needed by construction, because you
> realize this redundancy only at the end.
>
> So my question is, why is this "post-pruning" missing?
>
> Three hypothesis:
>
> 1) It is not suitable (for a reason I fail to see)
> 2) Such addition to the code is considered as not worth (in terms of code
> complexity, maybe)
> 3) It has been overlooked, but could be a favorable addition
>
> For clarity, I have managed to isolate a small case to reproduce this, in
> what follows.
>
> This is the dataset:
>
>> +-+-+
>> |label|features |
>> +-+-+
>> |1.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,0.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,0.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,0.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> +-+-+
>
>
> Which generates the following model:
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 1.0
>> Else (feature 0 > 0.5)
>>  Predict: 1.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> As you can see, the following model would be equivalent, but smaller and
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> Predict: 1.0
>>Else (feature 2 > 0.5)
>> Predict: 0.0
>
>
> This happens pretty often in real cases, and despite the small gain in the
> single model invocation for the "optimized" version, it can become non
> negligible when the number of calls is massive, as one can expect in a Big
> Data context.
>
> I would appreciate your opinion on this matter (if relevant for a PR or
> not, pros/cons etc).
>
> Best regards,
> Alessandro
>


[jira] [Commented] (SPARK-14047) GBT improvement umbrella

2018-02-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355216#comment-16355216
 ] 

Nick Pentreath commented on SPARK-14047:


SPARK-12375 should fix that? Can you check it against the 2.3 RC (or 
branch-2.3)? If not could you provide some code to reproduce the error?

> GBT improvement umbrella
> 
>
> Key: SPARK-14047
> URL: https://issues.apache.org/jira/browse/SPARK-14047
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for improvements to learning Gradient Boosted Trees: 
> GBTClassifier, GBTRegressor.
> Note: Aspects of GBTs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side
that should be everything outstanding.

On Thu, 1 Feb 2018 at 06:21 Yin Huai  wrote:

> seems we are not running tests related to pandas in pyspark tests (see my
> email "python tests related to pandas are skipped in jenkins"). I think we
> should fix this test issue and make sure all tests are good before cutting
> RC3.
>
> On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal 
> wrote:
>
>> Just a quick status update on RC3 -- SPARK-23274
>>  was resolved
>> yesterday and tests have been quite healthy throughout this week and the
>> last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202
>> ) is resolved.
>>
>>
>> On 30 January 2018 at 10:12, Andrew Ash  wrote:
>>
>>> I'd like to nominate SPARK-23274
>>>  as a potential
>>> blocker for the 2.3.0 release as well, due to being a regression from
>>> 2.2.0.  The ticket has a simple repro included, showing a query that works
>>> in prior releases but now fails with an exception in the catalyst optimizer.
>>>
>>> On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
>>> wrote:
>>>
 This vote has failed due to a number of aforementioned blockers. I'll
 follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
 resolved: https://s.apache.org/oXKi


 On 25 January 2018 at 12:59, Sameer Agarwal 
 wrote:

>
> Most tests pass on RC2, except I'm still seeing the timeout caused by
>> https://issues.apache.org/jira/browse/SPARK-23055 ; the tests never
>> finish. I followed the thread a bit further and wasn't clear whether it 
>> was
>> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with
>> https://issues.apache.org/jira/browse/SPARK-22908 for 2.3.0 though I
>> am still seeing these tests fail or hang:
>>
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> false)
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> true)
>>
>
> Sean, while some of these tests were timing out on RC1, we're not
> aware of any known issues in RC2. Both maven (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/146/testReport/org.apache.spark.sql.kafka010/history/)
> and sbt (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/123/testReport/org.apache.spark.sql.kafka010/history/)
> historical builds on jenkins for org.apache.spark.sql.kafka010 look fairly
> healthy. If you're still seeing timeouts in RC2, can you create a JIRA 
> with
> any applicable build/env info?
>
>
>
>> On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:
>>
>>> I'm not seeing that same problem on OS X and /usr/bin/tar. I tried
>>> unpacking it with 'xvzf' and also unzipping it first, and it untarred
>>> without warnings in either case.
>>>
>>> I am encountering errors while running the tests, different ones
>>> each time, so am still figuring out whether there is a real problem or 
>>> just
>>> flaky tests.
>>>
>>> These issues look like blockers, as they are inherently to be
>>> completed before the 2.3 release. They are mostly not done. I suppose 
>>> I'd
>>> -1 on behalf of those who say this needs to be done first, though, we 
>>> can
>>> keep testing.
>>>
>>> SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella
>>> SPARK-23114 Spark R 2.3 QA umbrella
>>>
>>> Here are the remaining items targeted for 2.3:
>>>
>>> SPARK-15689 Data source API v2
>>> SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
>>> SPARK-21646 Add new type coercion rules to compatible with Hive
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-22731 Add a test for ROWID type to OracleIntegrationSuite
>>> SPARK-22735 Add VectorSizeHint to ML features documentation
>>> SPARK-22739 Additional Expression Support for Objects
>>> SPARK-22809 pyspark is sensitive to imports with dots
>>> SPARK-22820 Spark 2.3 SQL API audit
>>>
>>>
>>> On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin 
>>> wrote:
>>>
 +0

 Signatures check out. Code compiles, although I see the errors in
 [1]
 when untarring the source archive; perhaps we should add "use GNU
 tar"
 to the RM checklist?

 Also ran our internal tests and they seem happy.

 My concern is the list of open bugs targeted at 2.3.0 (ignoring 

[jira] [Resolved] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23105.

   Resolution: Resolved
Fix Version/s: 2.3.0

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23110.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23107.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20459
[https://github.com/apache/spark/pull/20459]

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348223#comment-16348223
 ] 

Nick Pentreath commented on SPARK-23290:


cc [~bryanc]

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Major
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath edited comment on SPARK-23110 at 1/31/18 11:34 AM:
--

Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that [PR 19020|https://github.com/apache/spark/pull/19020] made 
the existing constructor for {{LinearRegressionModel}} public - I assume this 
was not intended cc [~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}


was (Author: mlnick):
Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath edited comment on SPARK-23110 at 1/31/18 11:32 AM:
--

Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}


was (Author: mlnick):
Took a quick look through the diff. 

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath commented on SPARK-23110:


Took a quick look through the diff. 

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346573#comment-16346573
 ] 

Nick Pentreath commented on SPARK-23110:


I checked added classes from {{added_ml_class}}, all seem fine:
 * logistic summaries have related Java examples that were tested in 
[PR20332|https://github.com/apache/spark/pull/20332]
 * clustering evaluator has related Java example (the other class is private)
 * feature hasher has related Java example
 * new OHE has Java example
 * vector size hint has Java example
 * image schema public method sigs seem fine (but no Java example as yet)
 * new params fine
 * summarizer public methods seem fine (the varargs {{metrics}} generates a 
Java-friendly forwarder) - though no Java example as yet

The rest are private.

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23111.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346442#comment-16346442
 ] 

Nick Pentreath commented on SPARK-23111:


Went through all the new features and listed the Jira tickets here. I think I 
got everything, but of course let me know if I missed any items. Resolving 
this. 

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23111:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23112.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20421
[https://github.com/apache/spark/pull/20421]

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344885#comment-16344885
 ] 

Nick Pentreath commented on SPARK-23154:


Where do we intend to put this note? In 
[http://spark.apache.org/docs/latest/ml-pipeline.html#saving-and-loading-pipelines?]
 Or as a new section in [http://spark.apache.org/docs/latest/ml-guide.html]?

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344604#comment-16344604
 ] 

Nick Pentreath commented on SPARK-23265:


cc [~huaxing] 

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Issue Type: Improvement  (was: Documentation)

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23265:
--

 Summary: Update multi-column error handling logic in 
QuantileDiscretizer
 Key: SPARK-23265
 URL: https://issues.apache.org/jira/browse/SPARK-23265
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets}}, since that is then applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23138.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20332
[https://github.com/apache/spark/pull/20332]

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23138:
--

Assignee: Seth Hendrickson

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Reverse MinMaxScaler in SparkML

2018-01-29 Thread Nick Pentreath
This would be interesting and a good addition I think.

It bears some thought about the API though. One approach is to have an
"inverseTransform" method similar to sklearn.

The other approach is to "formalize" something like StringIndexerModel ->
IndexToString. Here, the inverse transformer is a standalone transformer.
It could be returned from a "getInverseTransformer" method, for example.

The former approach is simpler, but cannot be used in pipelines (which work
on "fit" / "transform"). The latter approach is more cumbersome, but fits
better into pipelines.

So it depends on the use cases - i.e. how common is it to use the inverse
transform function within a pipeline (for StringIndexer <-> IndexToString
it is quite common to get back the labels, while for other transformers it
may or may not be).

On Mon, 8 Jan 2018 at 11:10 Tomasz Dudek 
wrote:

> Hello,
>
> since the similar question on StackOverflow remains unanswered (
> https://stackoverflow.com/questions/46092114/is-there-no-inverse-transform-method-for-a-scaler-like-minmaxscaler-in-spark
> ) and perhaps there is a solution that I am not aware of, I'll ask:
>
> After traning MinMaxScaler(or similar scaler) is there any built-in way to
> revert the process? What I mean is to transform the scaled data back to its
> original form. SKlearn has a dedicated method inverse_transform that does
> exactly that.
>
> I can, of course, get the originalMin/originalMax Vectors from the
> MinMaxScalerModel and then map the values myself but it would be nice to
> have it built-in.
>
> Yours,
> Tomasz
>
>


[jira] [Assigned] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23108:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343278#comment-16343278
 ] 

Nick Pentreath edited comment on SPARK-23108 at 1/29/18 12:14 PM:
--

Went through {{Experimental}} APIs, there could be a case for:
 * {{Regression / Binary / Multiclass}} evaluators as they've been around for a 
long time.
 * Linear regression summary (since {{1.5.0}}).
 * {{AFTSurvivalRegression}} (since {{1.6.0}}).

I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 


was (Author: mlnick):
I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23108.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.3.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343290#comment-16343290
 ] 

Nick Pentreath commented on SPARK-23108:


Also checked ml {{DeveloperAPI}}, nothing to graduate there I would say.

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343278#comment-16343278
 ] 

Nick Pentreath commented on SPARK-23108:


I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343276#comment-16343276
 ] 

Nick Pentreath commented on SPARK-23109:


Created SPARK-23256 to track {{columnSchema}} in Python API.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23256) Add columnSchema method to PySpark image reader

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23256:
--

 Summary: Add columnSchema method to PySpark image reader
 Key: SPARK-23256
 URL: https://issues.apache.org/jira/browse/SPARK-23256
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added support for reading image data into a DataFrame. The PySpark 
API is missing the {{columnSchema}} method in Scala API. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343269#comment-16343269
 ] 

Nick Pentreath commented on SPARK-23109:


So [~bryanc] I think this is done then? Can you confirm?

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343266#comment-16343266
 ] 

Nick Pentreath commented on SPARK-21866:


Ok, added SPARK-23255 to track user guide additions

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>  * BigDL
>  * DeepLearning4J
>  * Deep Learning Pipelines
>  * MMLSpark
>  * TensorFlow (Spark connector)
>  * TensorFlowOnSpark
>  * TensorFrames
>  * Thunder
> h2. Goals:
>  * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
>  * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
>  * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>  * the total size of an image should be restricted to less than 2GB (roughly)
>  * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
>  * specialized formats used in meteorology, the medical field, etc. are not 
> supported
>  * this format is specialized to images and does not attempt to solve the 
> more general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
>  {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
>  * StructField("mode", StringType(), False),
>  ** The exact representation of the data.
>  ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> &quo

[jira] [Created] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23255:
--

 Summary: Add user guide and examples for DataFrame image reading 
functions
 Key: SPARK-23255
 URL: https://issues.apache.org/jira/browse/SPARK-23255
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added built-in support for reading image data into a DataFrame. 
This new functionality should be documented in the user guide, with example 
usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23107:
---
Description: 
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue 
(SPARK-23111 for {{2.3}})

  was:
Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
* Protected/public classes or methods.  If access can be more private, then it 
should be.
* Also look for non-sealed traits.
* Documentation: Missing?  Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue.


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23227:
---
Priority: Minor  (was: Major)

> Add user guide entry for collecting sub models for cross-validation classes
> ---
>
> Key: SPARK-23227
> URL: https://issues.apache.org/jira/browse/SPARK-23227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23254:
---
Priority: Minor  (was: Major)

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23127:
---
Priority: Minor  (was: Major)

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.3.0
>
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23254:
--

 Summary: Add user guide entry for DataFrame multivariate summary
 Key: SPARK-23254
 URL: https://issues.apache.org/jira/browse/SPARK-23254
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
updated, with the relevant example (to be in parity with the [MLlib user 
guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343155#comment-16343155
 ] 

Nick Pentreath commented on SPARK-17139:


Ok added a PR to update migration guide for {{2.3}}

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341040#comment-16341040
 ] 

Nick Pentreath commented on SPARK-21866:


[~hyukjin.kwon] [~imatiach] Was any doc or examples done in the user guide for 
this feature? Seems like it would be good to add something.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "d

[jira] [Resolved] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23113.

Resolution: Resolved

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23113:
--

Assignee: Nick Pentreath

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341030#comment-16341030
 ] 

Nick Pentreath commented on SPARK-23113:


No updates to MLlib project website required for {{2.3}} release.

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341022#comment-16341022
 ] 

Nick Pentreath commented on SPARK-23107:


[~felixcheung] I added SPARK-23231 (and listed it in SPARK-23111)

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23231) Add doc for string indexer ordering to user guide (also to RFormula guide)

2018-01-26 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23231:
--

 Summary: Add doc for string indexer ordering to user guide (also 
to RFormula guide)
 Key: SPARK-23231
 URL: https://issues.apache.org/jira/browse/SPARK-23231
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.2.1, 2.3.0
Reporter: Nick Pentreath


SPARK-20619 and SPARK-20899 added an ordering parameter to {{StringIndexer}} 
and is also used internally in {{RFormula}}. Update the user guide for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341009#comment-16341009
 ] 

Nick Pentreath commented on SPARK-23110:


[~WeichenXu123] any update?

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



***UNCHECKED*** [jira] [Updated] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22797:
---
Target Version/s: 2.3.0  (was: 2.4.0)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22797:
--

Assignee: zhengruifeng

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22797.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19892
[https://github.com/apache/spark/pull/19892]

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22799.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19993
[https://github.com/apache/spark/pull/19993]

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.0
>
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22799:
--

Assignee: Marco Gaido

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Assignee: Marco Gaido
>Priority: Blocker
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes

2018-01-26 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23227:
--

 Summary: Add user guide entry for collecting sub models for 
cross-validation classes
 Key: SPARK-23227
 URL: https://issues.apache.org/jira/browse/SPARK-23227
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340786#comment-16340786
 ] 

Nick Pentreath commented on SPARK-23107:


[~felixcheung] have issues been created to track the addition of doc for 
{{RFormula}} changes? I guess it won't block release but we should create those 
issues if they haven't been done already.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23107:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23109:
--

Assignee: Bryan Cutler

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340783#comment-16340783
 ] 

Nick Pentreath commented on SPARK-23107:


[~yanboliang] any update on this one?

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reopened SPARK-23112:


Re-opening as breaking change in SPARK-17139 needs to be addressed

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23112:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0
Fix Version/s: (was: 2.3.0)

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340779#comment-16340779
 ] 

Nick Pentreath commented on SPARK-23106:


Will keep this as resolved as it should be done now - but will follow up on 
SPARK-23112

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23106:
--

Assignee: Bago Amirbekian

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340778#comment-16340778
 ] 

Nick Pentreath commented on SPARK-23106:


I've audited all the other ML-related MiMa exclusions added from the following 
tickets and found them to be ok.
 * SPARK-21680 (private method)
 * SPARK-3181 (new method added to trait but trait is private)
 * SPARK-17139 (add {{toBinary}} method to sealed trait / private concrete 
classes)
 * SPARK-21087 (private class -> final class, but constructor is private)

Let me know if anyone sees something I didn't check.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340735#comment-16340735
 ] 

Nick Pentreath commented on SPARK-23106:


SPARK-17139 breaks binary compat, I've commented there on details. It is for an 
{{Experimental}} API though so probably fine, just the migration guide will 
need to be updated.

 

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340728#comment-16340728
 ] 

Nick Pentreath commented on SPARK-17139:


So, in terms of binary compat, the change itself here overall is ok as the 
traits are sealed and the concrete impl are private classes (or had private 
constructors in 2.2)

However, in 2.2 and earlier versions, the only way to access the binary summary 
is through:

{{ asInstanceOf[BinaryLogisticRegressionSummary]}}

(as can be seen in {{LogisticRegressionSummaryExample}}).

That same code if run in Spark 2.3 will throw an error, as follows:

 
{code:java}
$ ./bin/spark-submit --class 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample 
PATH_TO_SPARK_2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar

...
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.spark.ml.classification.BinaryLogisticRegressionSummary, 
but class was expected
at 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample$.main(LogisticRegressionSummaryExample.scala:63)
at 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample.main(LogisticRegressionSummaryExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
The above was run with Spark built from branch-2.3 @ 
{{c79e771f8952e6773c3a84cc617145216feddbcf}} 

So this does break binary compat. However I don't really see a good way to 
avoid this and the way it's been done cleans things up best. Since it's marked 
{{Experimental}} we can live with this, but will need to update SPARK-23112 
with the details if all are in agreement.

cc [~WeichenXu123] [~bago.amirbekian] [~sethah] [~josephkb] [~yanboliang]

 

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340653#comment-16340653
 ] 

Nick Pentreath commented on SPARK-23109:


[~bryanc] can you add a Jira for adding {{columnSchema}} to Python?

Then if there is nothing else here, I can resolve this ticket (note this is for 
auditing, not fixing all the issues so anything outstanding won't block the 
release).

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Target Version/s: 2.3.0  (was: 2.4.0)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23106:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340645#comment-16340645
 ] 

Nick Pentreath commented on SPARK-23106:


Thanks [~bago.amirbekian]. However, running MiMa is not enough for this task, 
since some PRs are merged that add MiMa exclusions. So typically, to be safe, 
we would also double check the MiMa exclusions added for ML during the release 
cycle, to ensure the exclusions are valid (i.e. false positives etc. most 
commonly due to changes made to private classes that MiMa picks up but that are 
not part of the public API).

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23109:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23163:
--

Assignee: Bryan Cutler

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23163.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Priority: Blocker  (was: Major)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Blocker
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the
sub-items on:

SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella

are actually marked as Blockers, but are not targeted to 2.3.0. I think
they should be, and I'm not comfortable with those not being resolved
before voting positively on the release.

So I'm -1 too for that reason.

I think most of those review items are close to done, and there is also
https://issues.apache.org/jira/browse/SPARK-22799 that I think should be in
for 2.3 (to avoid a behavior change later between 2.3.0 and 2.3.1,
especially since we'll have another RC now it seems).


On Thu, 25 Jan 2018 at 19:28 Marcelo Vanzin  wrote:

> Sorry, have to change my vote again. Hive guys ran into SPARK-23209
> and that's a regression we need to fix. I'll post a patch soon. So -1
> (although others have already -1'ed).
>
> On Wed, Jan 24, 2018 at 11:42 AM, Marcelo Vanzin 
> wrote:
> > Given that the bugs I was worried about have been dealt with, I'm
> > upgrading to +1.
> >
> > On Mon, Jan 22, 2018 at 5:09 PM, Marcelo Vanzin 
> wrote:
> >> +0
> >>
> >> Signatures check out. Code compiles, although I see the errors in [1]
> >> when untarring the source archive; perhaps we should add "use GNU tar"
> >> to the RM checklist?
> >>
> >> Also ran our internal tests and they seem happy.
> >>
> >> My concern is the list of open bugs targeted at 2.3.0 (ignoring the
> >> documentation ones). It is not long, but it seems some of those need
> >> to be looked at. It would be nice for the committers who are involved
> >> in those bugs to take a look.
> >>
> >> [1]
> https://superuser.com/questions/318809/linux-os-x-tar-incompatibility-tarballs-created-on-os-x-give-errors-when-unt
> >>
> >>
> >> On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal 
> wrote:
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am
> UTC and
> >>> passes if a majority of at least 3 PMC +1 votes are cast.
> >>>
> >>>
> >>> [ ] +1 Release this package as Apache Spark 2.3.0
> >>>
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>>
> >>> To learn more about Apache Spark, please see https://spark.apache.org/
> >>>
> >>> The tag to be voted on is v2.3.0-rc2:
> >>> https://github.com/apache/spark/tree/v2.3.0-rc2
> >>> (489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91)
> >>>
> >>> List of JIRA tickets resolved in this release can be found here:
> >>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1262/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-docs/_site/index.html
> >>>
> >>>
> >>> FAQ
> >>>
> >>> ===
> >>> What are the unresolved issues targeted for 2.3.0?
> >>> ===
> >>>
> >>> Please see https://s.apache.org/oXKi. At the time of writing, there
> are
> >>> currently no known release blockers.
> >>>
> >>> =
> >>> How can I help test this release?
> >>> =
> >>>
> >>> If you are a Spark user, you can help us test this release by taking an
> >>> existing Spark workload and running on this release candidate, then
> >>> reporting any regressions.
> >>>
> >>> If you're working in PySpark you can set up a virtual env and install
> the
> >>> current RC and see if anything important breaks, in the Java/Scala you
> can
> >>> add the staging repository to your projects resolvers and test with
> the RC
> >>> (make sure to clean up the artifact cache before/after so you don't
> end up
> >>> building with a out of date RC going forward).
> >>>
> >>> ===
> >>> What should happen to JIRA tickets still targeting 2.3.0?
> >>> ===
> >>>
> >>> Committers should look at those and triage. Extremely important bug
> fixes,
> >>> documentation, and API tweaks that impact compatibility should be
> worked on
> >>> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> >>> appropriate.
> >>>
> >>> ===
> >>> Why is my bug not fixed?
> >>> ===
> >>>
> >>> In order to make timely releases, we will typically not hold the
> release
> >>> unless the bug in question is a regression from 2.2.0. That being
> said, if
> >>> there is something which is a regression from 2.2.0 and has not been
> >>> correctly targeted 

[jira] [Assigned] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23112:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23112.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20363
[https://github.com/apache/spark/pull/20363]

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22735) Add VectorSizeHint to ML features documentation

2018-01-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22735.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add VectorSizeHint to ML features documentation
> ---
>
> Key: SPARK-22735
> URL: https://issues.apache.org/jira/browse/SPARK-22735
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335821#comment-16335821
 ] 

Nick Pentreath commented on SPARK-23112:


{{OneHotEncoder}} is the only deprecation I can see - but let me know if I 
missed anything.

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >