[jira] [Resolved] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-25412.

Resolution: Not A Bug

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613160#comment-16613160
 ] 

Nick Pentreath commented on SPARK-25412:


(1) is by design. Feature hashing does not store the exact mapping from feature 
values to vector indices and so is a one way transform. Hashing gives you speed 
and requires almost no memory, but you give up the reverse mapping and you have 
the potential for hash collisions.

(2) is again by design for now. There are ways to have the sign of the feature 
value be determined also as part of a hash function, and in expectation the 
collisions zero each other out. This may be added in future work.

The impact of hash collisions can be reduced by increasing the {{numFeatures}} 
parameter. The default is probably reasonable for small to medium feature 
dimensions but should probably be increased when working with very 
high-cardinality features.

 

I don't think this can be classed as a bug as these are all design and 
tradeoffs of using feature hashing 

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-19 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516861#comment-16516861
 ] 

Nick Pentreath commented on SPARK-24467:


One option is to do that same as we did for one hot encoder: we could create a 
new Estimator/Model pair, and deprecate the old one, for 2.4.0. Then for 3.0, 
we could remove the old one.

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel. Though perhaps the 
existing one can be made a Model without breaking things}}


was (Author: mlnick):
Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-08 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506334#comment-16506334
 ] 

Nick Pentreath commented on SPARK-24467:


Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366809#comment-16366809
 ] 

Nick Pentreath commented on SPARK-23265:


Thanks for the ping - yes it adds more detailed checking of the exclusive 
params and would introduce an error being thrown in certain additional 
situations (specifically {{numBucketsArray}} set for single-column transform, 
{{numBuckets}} and {{numBucketsArray}} set for multi-column transform, 
mismatched length of {{numBucketsArray}} with input/output columns for 
multi-column transform).

I reviewed the PR and LGTM so as I said there we can merge this now before RC4 
gets cut.

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
\{{numBuckets}} when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366744#comment-16366744
 ] 

Nick Pentreath commented on SPARK-23437:


It sounds interesting - however the standard practice is that new algorithms 
should probably be released as a 3rd party Spark package. If they become 
widely-used then there is a stronger argument for integration into MLlib.

See [http://spark.apache.org/contributing.html] under the MLlib section for 
more details. 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362182#comment-16362182
 ] 

Nick Pentreath commented on SPARK-23377:


Should this be a blocker for 2.3? I think so since it should really be fixed 
before release.

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14047) GBT improvement umbrella

2018-02-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355216#comment-16355216
 ] 

Nick Pentreath commented on SPARK-14047:


SPARK-12375 should fix that? Can you check it against the 2.3 RC (or 
branch-2.3)? If not could you provide some code to reproduce the error?

> GBT improvement umbrella
> 
>
> Key: SPARK-14047
> URL: https://issues.apache.org/jira/browse/SPARK-14047
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for improvements to learning Gradient Boosted Trees: 
> GBTClassifier, GBTRegressor.
> Note: Aspects of GBTs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23105.

   Resolution: Resolved
Fix Version/s: 2.3.0

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23110.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-02-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23107.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20459
[https://github.com/apache/spark/pull/20459]

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348223#comment-16348223
 ] 

Nick Pentreath commented on SPARK-23290:


cc [~bryanc]

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Major
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath edited comment on SPARK-23110 at 1/31/18 11:34 AM:
--

Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that [PR 19020|https://github.com/apache/spark/pull/19020] made 
the existing constructor for {{LinearRegressionModel}} public - I assume this 
was not intended cc [~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}


was (Author: mlnick):
Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath edited comment on SPARK-23110 at 1/31/18 11:32 AM:
--

Took a quick look through the diff. Apart from one issue all looks ok.

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}


was (Author: mlnick):
Took a quick look through the diff. 

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346645#comment-16346645
 ] 

Nick Pentreath commented on SPARK-23110:


Took a quick look through the diff. 

I did pick up that PR X made the existing constructor for 
{{LinearRegressionModel}} public - I assume this was not intended cc 
[~yanboliang]? 

 
{code:java}
def this(uid: String, coefficients: Vector, intercept: Double) =
  this(uid, coefficients, intercept, 1.0){code}

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346573#comment-16346573
 ] 

Nick Pentreath commented on SPARK-23110:


I checked added classes from {{added_ml_class}}, all seem fine:
 * logistic summaries have related Java examples that were tested in 
[PR20332|https://github.com/apache/spark/pull/20332]
 * clustering evaluator has related Java example (the other class is private)
 * feature hasher has related Java example
 * new OHE has Java example
 * vector size hint has Java example
 * image schema public method sigs seem fine (but no Java example as yet)
 * new params fine
 * summarizer public methods seem fine (the varargs {{metrics}} generates a 
Java-friendly forwarder) - though no Java example as yet

The rest are private.

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23111.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346442#comment-16346442
 ] 

Nick Pentreath commented on SPARK-23111:


Went through all the new features and listed the Jira tickets here. I think I 
got everything, but of course let me know if I missed any items. Resolving 
this. 

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23111) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23111:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-23111
> URL: https://issues.apache.org/jira/browse/SPARK-23111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23112.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20421
[https://github.com/apache/spark/pull/20421]

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344885#comment-16344885
 ] 

Nick Pentreath commented on SPARK-23154:


Where do we intend to put this note? In 
[http://spark.apache.org/docs/latest/ml-pipeline.html#saving-and-loading-pipelines?]
 Or as a new section in [http://spark.apache.org/docs/latest/ml-guide.html]?

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344604#comment-16344604
 ] 

Nick Pentreath commented on SPARK-23265:


cc [~huaxing] 

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Issue Type: Improvement  (was: Documentation)

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23265:
--

 Summary: Update multi-column error handling logic in 
QuantileDiscretizer
 Key: SPARK-23265
 URL: https://issues.apache.org/jira/browse/SPARK-23265
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets}}, since that is then applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23138.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20332
[https://github.com/apache/spark/pull/20332]

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23138:
--

Assignee: Seth Hendrickson

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23108:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343278#comment-16343278
 ] 

Nick Pentreath edited comment on SPARK-23108 at 1/29/18 12:14 PM:
--

Went through {{Experimental}} APIs, there could be a case for:
 * {{Regression / Binary / Multiclass}} evaluators as they've been around for a 
long time.
 * Linear regression summary (since {{1.5.0}}).
 * {{AFTSurvivalRegression}} (since {{1.6.0}}).

I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 


was (Author: mlnick):
I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23108.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.3.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343290#comment-16343290
 ] 

Nick Pentreath commented on SPARK-23108:


Also checked ml {{DeveloperAPI}}, nothing to graduate there I would say.

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343278#comment-16343278
 ] 

Nick Pentreath commented on SPARK-23108:


I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343276#comment-16343276
 ] 

Nick Pentreath commented on SPARK-23109:


Created SPARK-23256 to track {{columnSchema}} in Python API.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23256) Add columnSchema method to PySpark image reader

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23256:
--

 Summary: Add columnSchema method to PySpark image reader
 Key: SPARK-23256
 URL: https://issues.apache.org/jira/browse/SPARK-23256
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added support for reading image data into a DataFrame. The PySpark 
API is missing the {{columnSchema}} method in Scala API. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343269#comment-16343269
 ] 

Nick Pentreath commented on SPARK-23109:


So [~bryanc] I think this is done then? Can you confirm?

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343266#comment-16343266
 ] 

Nick Pentreath commented on SPARK-21866:


Ok, added SPARK-23255 to track user guide additions

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>  * BigDL
>  * DeepLearning4J
>  * Deep Learning Pipelines
>  * MMLSpark
>  * TensorFlow (Spark connector)
>  * TensorFlowOnSpark
>  * TensorFrames
>  * Thunder
> h2. Goals:
>  * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
>  * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
>  * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>  * the total size of an image should be restricted to less than 2GB (roughly)
>  * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
>  * specialized formats used in meteorology, the medical field, etc. are not 
> supported
>  * this format is specialized to images and does not attempt to solve the 
> more general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
>  {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
>  * StructField("mode", StringType(), False),
>  ** The exact representation of the data.
>  ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
>  ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 

[jira] [Created] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23255:
--

 Summary: Add user guide and examples for DataFrame image reading 
functions
 Key: SPARK-23255
 URL: https://issues.apache.org/jira/browse/SPARK-23255
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added built-in support for reading image data into a DataFrame. 
This new functionality should be documented in the user guide, with example 
usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23107:
---
Description: 
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue 
(SPARK-23111 for {{2.3}})

  was:
Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
* Protected/public classes or methods.  If access can be more private, then it 
should be.
* Also look for non-sealed traits.
* Documentation: Missing?  Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue.


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23227:
---
Priority: Minor  (was: Major)

> Add user guide entry for collecting sub models for cross-validation classes
> ---
>
> Key: SPARK-23227
> URL: https://issues.apache.org/jira/browse/SPARK-23227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23254:
---
Priority: Minor  (was: Major)

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-29 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23127:
---
Priority: Minor  (was: Major)

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.3.0
>
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-29 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23254:
--

 Summary: Add user guide entry for DataFrame multivariate summary
 Key: SPARK-23254
 URL: https://issues.apache.org/jira/browse/SPARK-23254
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
updated, with the relevant example (to be in parity with the [MLlib user 
guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2018-01-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343155#comment-16343155
 ] 

Nick Pentreath commented on SPARK-17139:


Ok added a PR to update migration guide for {{2.3}}

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341040#comment-16341040
 ] 

Nick Pentreath commented on SPARK-21866:


[~hyukjin.kwon] [~imatiach] Was any doc or examples done in the user guide for 
this feature? Seems like it would be good to add something.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> 

[jira] [Resolved] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23113.

Resolution: Resolved

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23113:
--

Assignee: Nick Pentreath

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23113) Update MLlib, GraphX websites for 2.3

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341030#comment-16341030
 ] 

Nick Pentreath commented on SPARK-23113:


No updates to MLlib project website required for {{2.3}} release.

> Update MLlib, GraphX websites for 2.3
> -
>
> Key: SPARK-23113
> URL: https://issues.apache.org/jira/browse/SPARK-23113
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341022#comment-16341022
 ] 

Nick Pentreath commented on SPARK-23107:


[~felixcheung] I added SPARK-23231 (and listed it in SPARK-23111)

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23231) Add doc for string indexer ordering to user guide (also to RFormula guide)

2018-01-26 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23231:
--

 Summary: Add doc for string indexer ordering to user guide (also 
to RFormula guide)
 Key: SPARK-23231
 URL: https://issues.apache.org/jira/browse/SPARK-23231
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.2.1, 2.3.0
Reporter: Nick Pentreath


SPARK-20619 and SPARK-20899 added an ordering parameter to {{StringIndexer}} 
and is also used internally in {{RFormula}}. Update the user guide for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341009#comment-16341009
 ] 

Nick Pentreath commented on SPARK-23110:


[~WeichenXu123] any update?

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



***UNCHECKED*** [jira] [Updated] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22797:
---
Target Version/s: 2.3.0  (was: 2.4.0)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22797:
--

Assignee: zhengruifeng

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22797.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19892
[https://github.com/apache/spark/pull/19892]

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22799.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19993
[https://github.com/apache/spark/pull/19993]

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.0
>
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22799:
--

Assignee: Marco Gaido

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Marco Gaido
>Priority: Blocker
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes

2018-01-26 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23227:
--

 Summary: Add user guide entry for collecting sub models for 
cross-validation classes
 Key: SPARK-23227
 URL: https://issues.apache.org/jira/browse/SPARK-23227
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340786#comment-16340786
 ] 

Nick Pentreath commented on SPARK-23107:


[~felixcheung] have issues been created to track the addition of doc for 
{{RFormula}} changes? I guess it won't block release but we should create those 
issues if they haven't been done already.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23107:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23109:
--

Assignee: Bryan Cutler

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340783#comment-16340783
 ] 

Nick Pentreath commented on SPARK-23107:


[~yanboliang] any update on this one?

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reopened SPARK-23112:


Re-opening as breaking change in SPARK-17139 needs to be addressed

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23112:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0
Fix Version/s: (was: 2.3.0)

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340779#comment-16340779
 ] 

Nick Pentreath commented on SPARK-23106:


Will keep this as resolved as it should be done now - but will follow up on 
SPARK-23112

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23106:
--

Assignee: Bago Amirbekian

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340778#comment-16340778
 ] 

Nick Pentreath commented on SPARK-23106:


I've audited all the other ML-related MiMa exclusions added from the following 
tickets and found them to be ok.
 * SPARK-21680 (private method)
 * SPARK-3181 (new method added to trait but trait is private)
 * SPARK-17139 (add {{toBinary}} method to sealed trait / private concrete 
classes)
 * SPARK-21087 (private class -> final class, but constructor is private)

Let me know if anyone sees something I didn't check.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340735#comment-16340735
 ] 

Nick Pentreath commented on SPARK-23106:


SPARK-17139 breaks binary compat, I've commented there on details. It is for an 
{{Experimental}} API though so probably fine, just the migration guide will 
need to be updated.

 

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2018-01-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340728#comment-16340728
 ] 

Nick Pentreath commented on SPARK-17139:


So, in terms of binary compat, the change itself here overall is ok as the 
traits are sealed and the concrete impl are private classes (or had private 
constructors in 2.2)

However, in 2.2 and earlier versions, the only way to access the binary summary 
is through:

{{ asInstanceOf[BinaryLogisticRegressionSummary]}}

(as can be seen in {{LogisticRegressionSummaryExample}}).

That same code if run in Spark 2.3 will throw an error, as follows:

 
{code:java}
$ ./bin/spark-submit --class 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample 
PATH_TO_SPARK_2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar

...
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.spark.ml.classification.BinaryLogisticRegressionSummary, 
but class was expected
at 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample$.main(LogisticRegressionSummaryExample.scala:63)
at 
org.apache.spark.examples.ml.LogisticRegressionSummaryExample.main(LogisticRegressionSummaryExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
The above was run with Spark built from branch-2.3 @ 
{{c79e771f8952e6773c3a84cc617145216feddbcf}} 

So this does break binary compat. However I don't really see a good way to 
avoid this and the way it's been done cleans things up best. Since it's marked 
{{Experimental}} we can live with this, but will need to update SPARK-23112 
with the details if all are in agreement.

cc [~WeichenXu123] [~bago.amirbekian] [~sethah] [~josephkb] [~yanboliang]

 

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340653#comment-16340653
 ] 

Nick Pentreath commented on SPARK-23109:


[~bryanc] can you add a Jira for adding {{columnSchema}} to Python?

Then if there is nothing else here, I can resolve this ticket (note this is for 
auditing, not fixing all the issues so anything outstanding won't block the 
release).

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Target Version/s: 2.3.0  (was: 2.4.0)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23106:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340645#comment-16340645
 ] 

Nick Pentreath commented on SPARK-23106:


Thanks [~bago.amirbekian]. However, running MiMa is not enough for this task, 
since some PRs are merged that add MiMa exclusions. So typically, to be safe, 
we would also double check the MiMa exclusions added for ML during the release 
cycle, to ensure the exclusions are valid (i.e. false positives etc. most 
commonly due to changes made to private classes that MiMa picks up but that are 
not part of the public API).

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23109:
---
Affects Version/s: 2.3.0
 Target Version/s: 2.3.0

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23163:
--

Assignee: Bryan Cutler

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23163) Sync Python ML API docs with Scala

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23163.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Sync Python ML API docs with Scala
> --
>
> Key: SPARK-23163
> URL: https://issues.apache.org/jira/browse/SPARK-23163
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Fix a few doc issues as reported in 2.3 ML QA SPARK-23109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Priority: Blocker  (was: Major)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Blocker
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23112:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23112.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20363
[https://github.com/apache/spark/pull/20363]

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22735) Add VectorSizeHint to ML features documentation

2018-01-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22735.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add VectorSizeHint to ML features documentation
> ---
>
> Key: SPARK-22735
> URL: https://issues.apache.org/jira/browse/SPARK-22735
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335821#comment-16335821
 ] 

Nick Pentreath commented on SPARK-23112:


{{OneHotEncoder}} is the only deprecation I can see - but let me know if I 
missed anything.

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335526#comment-16335526
 ] 

Nick Pentreath commented on SPARK-23105:


Certain of the ML QA sub-tasks are marked {{Blocker}} - SPARK-23106, 
SPARK-23109, SPARK-23108, SPARK-23109, SPARK-23110 - but they are not targeted 
for {{2.3.0}}. I think they should be surely? 

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13964) Feature hashing improvements

2018-01-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334599#comment-16334599
 ] 

Nick Pentreath commented on SPARK-13964:


Yes, that's certainly something I'd like to see added to the {{FeatureHasher}}

> Feature hashing improvements
> 
>
> Key: SPARK-13964
> URL: https://issues.apache.org/jira/browse/SPARK-13964
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Investigate improvements to Spark ML feature hashing (see e.g. 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332252#comment-16332252
 ] 

Nick Pentreath commented on SPARK-23154:


SGTM

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23048:
--

Assignee: Liang-Chi Hsieh

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23048.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20257
[https://github.com/apache/spark/pull/20257]

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23127.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20293
[https://github.com/apache/spark/pull/20293]

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Major
> Fix For: 2.3.0
>
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23127:
--

Assignee: Nick Pentreath

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Major
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23127:
---
Description: SPARK-22801 added the {{categoricalCols}} parameter and 
updated the Scala and Python doc, but did not update the user guide entry 
discussing feature handling.

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23127:
--

 Summary: Update FeatureHasher user guide for catCols parameter
 Key: SPARK-23127
 URL: https://issues.apache.org/jira/browse/SPARK-23127
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23060) RDD's apply function

2018-01-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326866#comment-16326866
 ] 

Nick Pentreath commented on SPARK-23060:


I agree I don't see enough of a compelling case for adding this to the public 
API.

> RDD's apply function
> 
>
> Key: SPARK-23060
> URL: https://issues.apache.org/jira/browse/SPARK-23060
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Gianmarco Donetti
>Priority: Minor
>  Labels: features, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> New function for RDDs -> apply
> >>> def foo(rdd):
> ... return rdd.map(lambda x: x.split('|')).filter(lambda x: x[0] 
> == 'ERROR')
> >>> rdd = sc.parallelize(['ERROR|10', 'ERROR|12', 'WARNING|10', 
> 'INFO|2'])
> >>> result = rdd.apply(foo)
> >>> result.collect()
> [('ERROR', '10'), ('ERROR', '12')]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21108) convert LinearSVC to aggregator framework

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21108.

Resolution: Fixed

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Chunsheng Ji

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Chunsheng Ji
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: (was: Weichen Xu)

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Weichen Xu

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21856.

Resolution: Fixed

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326151#comment-16326151
 ] 

Nick Pentreath commented on SPARK-22943:


Does the new estimator & model version of OHE solve this underlying issue? 

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22993.

Resolution: Fixed

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22993:
--

Assignee: Seth Hendrickson

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >