[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-07 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

Pending some dramatic response from \[~sethah\] telling me to back off, I'll 
take over this one. \[~josephkb\], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different 
orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a 
first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses 
the first-order approximation) and the latter is equivalent to LAD_TreeBoost 
from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it 
performs worse. Both these "generic loss" instantiations become new impurities 
that the user could set, just like 'gini' or 'entropy'. This calculator will 
implement corresponding terminal-leaf predictions, either the mean or median of 
the leaf's sample. Computing the median may require modifications to the 
internal developer API so that at some point the calculator can access the 
entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

Questions for \[~josephkb\]:
1. Should I ditch making the second order approximation that \[2\] does? It 
won't make the code any simpler, but might make the theoretical offerings of 
the new easier to grasp. This would add another task "try out second order 
Taylor approx" to the below, and also means we won't perform as well as xgb 
until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
>

[jira] [Comment Edited] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-07 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502
 ] 

Vladimir Feinberg edited comment on SPARK-4240 at 7/7/16 6:03 PM:
--

Pending some dramatic response from [~sethah] telling me to back off, I'll take 
over this one. [~josephkb], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different 
orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a 
first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses 
the first-order approximation) and the latter is equivalent to LAD_TreeBoost 
from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it 
performs worse. Both these "generic loss" instantiations become new impurities 
that the user could set, just like 'gini' or 'entropy'. This calculator will 
implement corresponding terminal-leaf predictions, either the mean or median of 
the leaf's sample. Computing the median may require modifications to the 
internal developer API so that at some point the calculator can access the 
entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

Questions for [~josephkb]:
1. Should I ditch making the second order approximation that \[2\] does? It 
won't make the code any simpler, but might make the theoretical offerings of 
the new easier to grasp. This would add another task "try out second order 
Taylor approx" to the below, and also means we won't perform as well as xgb 
until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf



was (Author: vlad.feinberg):
Pending some dramatic response from \[~sethah\] telling me to back off, I'll 
take over this one. \[~josephkb\], mind review

[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2016-07-11 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371420#comment-15371420
 ] 

Vladimir Feinberg commented on SPARK-10931:
---

[~josephkb] Te intention of this JIRA is a bit confusing. To my understanding, 
there are three kinds of params:

1. Estimator-related params that only have to do with fitting (e.g., 
regularization)
2. Independent model and estimator-related params to do with prediction (e.g., 
number of maximum iterations)
3. Shared model and estimator params that are set once per fitted pipeline 
(e.g., number of components in PCA).

I'd venture that we'd want a model to have:

1. Access to an immutable version of (1) and (3).
  * In Scala, this is done by having a {{parent}} reference to the generating 
{{Estimator}}, but this is a reference, so if the estimator changes then the 
params will, too, inconsistent with the model. It should be copy-on-write (this 
may be SPARK-7494, I'm not sure). Also, {{parent}} is a mutable reference.
  * In Python, there is no {{parent}}

2. Access to a mutable version of (2), where mutation should change model 
behavior
  * Both languages have this.

3. Separation of concerns. If a parameter falls into categories (1) or (3), it 
shouldn't be a parameter for the model, since changing its value has no effect 
except confusion
  * Both Python and Scala will, as of this JIRA, copy everything - groups (1), 
(2), (3) - to the model, each with its own version.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16504) UDAF should be typed

2016-07-12 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16504:
-

 Summary: UDAF should be typed
 Key: SPARK-16504
 URL: https://issues.apache.org/jira/browse/SPARK-16504
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vladimir Feinberg


Currently, UDAFs can be implemented by using a generic 
{{MutableAggregationBuffer}}. This type-less class requires the user specify 
the schema.

If the user wants to create vector output from a UDAF, this requires specifying 
an output schema with a VectorUDT(), which is only accessible through a 
DeveloperApi.

Since we would prefer not to expose VectorUDT, the only option would be to 
resolve the user's inability to (legally) specify a schema containing a 
VectorUDT the same way that we would do so for creating dataframes: by type 
inference, just like createDataFrame does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16504) UDAF should be typed

2016-07-12 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373769#comment-15373769
 ] 

Vladimir Feinberg commented on SPARK-16504:
---

fwiw {{merge}} has type {{(MAB, Row):Unit}} instead of {{(MAB, MAB): Unit}} or 
even more preferably {{(MAB, MAB): MAB}} for some reason.

> UDAF should be typed
> 
>
> Key: SPARK-16504
> URL: https://issues.apache.org/jira/browse/SPARK-16504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Vladimir Feinberg
>
> Currently, UDAFs can be implemented by using a generic 
> {{MutableAggregationBuffer}}. This type-less class requires the user specify 
> the schema.
> If the user wants to create vector output from a UDAF, this requires 
> specifying an output schema with a VectorUDT(), which is only accessible 
> through a DeveloperApi.
> Since we would prefer not to expose VectorUDT, the only option would be to 
> resolve the user's inability to (legally) specify a schema containing a 
> VectorUDT the same way that we would do so for creating dataframes: by type 
> inference, just like createDataFrame does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16551) Accumulator Examples should demonstrate different use case from UDAFs

2016-07-14 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16551:
-

 Summary: Accumulator Examples should demonstrate different use 
case from UDAFs
 Key: SPARK-16551
 URL: https://issues.apache.org/jira/browse/SPARK-16551
 Project: Spark
  Issue Type: Documentation
Reporter: Vladimir Feinberg


Currently, the Spark programming guide demonstrates Accumulators 
(http://spark.apache.org/docs/latest/programming-guide.html#accumulators) by 
taking the sum of an RDD.

This example makes new users think that Accumulators serve the role that UDAFs 
do, which they don't. They're meant to be out-of-band, small values that don't 
break pipe-lining. Documentation examples and notes should reflect this (and 
warn that they may cause driver bottlenecks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16572) DStream Kinesis Connector Doc formatting

2016-07-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16572:
-

 Summary: DStream Kinesis Connector Doc formatting
 Key: SPARK-16572
 URL: https://issues.apache.org/jira/browse/SPARK-16572
 Project: Spark
  Issue Type: Documentation
Reporter: Vladimir Feinberg
Priority: Minor


Formatting is off for the Kinesis doc for the old streaming API:
https://github.com/apache/spark/blob/05d7151ccbccdd977ec2f2301d5b12566018c988/docs/streaming-kinesis-integration.md

The code blocks aren't formatted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16572) DStream Kinesis Connector Doc formatting

2016-07-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg closed SPARK-16572.
-
Resolution: Fixed

Layout is just not github-compatible.

> DStream Kinesis Connector Doc formatting
> 
>
> Key: SPARK-16572
> URL: https://issues.apache.org/jira/browse/SPARK-16572
> Project: Spark
>  Issue Type: Documentation
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> Formatting is off for the Kinesis doc for the old streaming API:
> https://github.com/apache/spark/blob/05d7151ccbccdd977ec2f2301d5b12566018c988/docs/streaming-kinesis-integration.md
> The code blocks aren't formatted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-22 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502
 ] 

Vladimir Feinberg edited comment on SPARK-4240 at 7/22/16 4:47 PM:
---

Pending some dramatic response from [~sethah] telling me to back off, I'll take 
over this one. [~josephkb], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Make loss-reduction calculator for regression:
 ** Add squared and absolute losses
 **  'loss-based' induces a second-order approximation for squared loss, and 
only a first-order approximation for absolute loss
 ** The former should perform like LS_Boost from \[1\] and the latter is 
sort-of (*) equivalent to LAD_TreeBoost from \[1\]. Both these "generic loss" 
instantiations become new impurities that the user could set, just like 'gini' 
or 'entropy'. This calculator will implement corresponding terminal-leaf 
predictions, either the mean or median of the leaf's sample. Computing the 
median may require modifications to the internal developer API so that at some 
point the calculator can access the entire set of training samples a terminal 
node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to sort-of (*) L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

(*) A note regarding the sort-of equivalence with Friedman: in his 1999 paper, 
Friedman admits that he's not doing "true" TreeBoost because he builds the tree 
based on variance reduction of the the residuals. This is exactly what \[3\] 
does. \[2\] instead builds the tree by optimizing a _taylor approximation_ for 
the losses, which makes it feasible to efficiently consider many splits in a 
leaf (because of the additive nature of the approximate loss function).

* For logistic, this works really well for XGBoost.
* For squared error, both approaches are equivalent
* For absolute error, the Taylor approximation can be first-order only (but 
locally, it's a perfect approximation). I don't think anyone has done even this 
approximate version of "true" L1 TreeBoost before. It may be necessary to go 
the way gbm does and use variance impurity, but we'll try it out anyway.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3

[jira] [Created] (SPARK-16718) gbm-style treeboost

2016-07-25 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16718:
-

 Summary: gbm-style treeboost
 Key: SPARK-16718
 URL: https://issues.apache.org/jira/browse/SPARK-16718
 Project: Spark
  Issue Type: Sub-task
Reporter: Vladimir Feinberg


As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree 
leafs in GBTs can have loss-optimal predictions for their partition of the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16718) gbm-style treeboost

2016-07-25 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16718:
--
Description: 
As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree 
leafs in GBTs can have loss-optimal predictions for their partition of the data.

Commit should have evidence of accuracy improvment


  was:As an initial minimal change, we should provide TreeBoost as implemented 
in GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, 
tree leafs in GBTs can have loss-optimal predictions for their partition of the 
data.


> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, 
> tree leafs in GBTs can have loss-optimal predictions for their partition of 
> the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16718) gbm-style treeboost

2016-07-25 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16718:
--
Description: 
As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, 
tree leafs in GBTs can have loss-optimal predictions for their partition of the 
data.

Commit should have evidence of accuracy improvment


  was:
As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree 
leafs in GBTs can have loss-optimal predictions for their partition of the data.

Commit should have evidence of accuracy improvment



> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16718) gbm-style treeboost

2016-07-25 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392914#comment-15392914
 ] 

Vladimir Feinberg commented on SPARK-16718:
---

L1 support for loss-based impurity will be delayed until there's a new internal 
API for GBTs in spark.ml

> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml

2016-07-25 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16728:
-

 Summary: migrate internal API for MLlib trees from spark.mllib to 
spark.ml
 Key: SPARK-16728
 URL: https://issues.apache.org/jira/browse/SPARK-16728
 Project: Spark
  Issue Type: Sub-task
Reporter: Vladimir Feinberg


Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16739) GBTClassifier should be a Classifier, not Predictor

2016-07-26 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16739:
-

 Summary: GBTClassifier should be a Classifier, not Predictor
 Key: SPARK-16739
 URL: https://issues.apache.org/jira/browse/SPARK-16739
 Project: Spark
  Issue Type: Improvement
Reporter: Vladimir Feinberg
Priority: Minor


Should probably wait for SPARK-4240 to be resolved first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16718) gbm-style treeboost

2016-07-26 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16718:
--
Description: 
.As an initial minimal change, we should provide TreeBoost as implemented in 
GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
impurity, tree leafs in GBTs can have loss-optimal predictions for their 
partition of the data.

Commit should have evidence of accuracy improvment


  was:
As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, 
tree leafs in GBTs can have loss-optimal predictions for their partition of the 
data.

Commit should have evidence of accuracy improvment



> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> .As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16860) UDT Stringification Incorrect in PySpark

2016-08-02 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16860:
-

 Summary: UDT Stringification Incorrect in PySpark
 Key: SPARK-16860
 URL: https://issues.apache.org/jira/browse/SPARK-16860
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Vladimir Feinberg
Priority: Minor


When using `show()` on a `DataFrame` containing a UDT, Spark doesn't call the 
appropriate `__str__` method for display.

Example: https://gist.github.com/vlad17/baa8e18ed724c4d88436a92ca159dd5b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2016-08-04 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16899:
-

 Summary: Structured Streaming Checkpointing Example invalid
 Key: SPARK-16899
 URL: https://issues.apache.org/jira/browse/SPARK-16899
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Vladimir Feinberg
Priority: Critical


The structured streaming checkpointing example at the bottom of the page 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
 has the following excerpt:
```
aggDF
   .writeStream
   .outputMode("complete")
   .option(“checkpointLocation”, “path/to/HDFS/dir”)
   .format("memory")
   .start()
```

But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
following error: 
```
This query does not support recovering from checkpoint location. Delete 
/tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
over.;
```

The documentation should be changed to demonstrate checkpointing for a 
non-aggregation streaming task, and explicitly mention there is no way to 
checkpoint aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16900) Complete-mode output for file sinks

2016-08-04 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16900:
-

 Summary: Complete-mode output for file sinks
 Key: SPARK-16900
 URL: https://issues.apache.org/jira/browse/SPARK-16900
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Vladimir Feinberg


Currently there is no way to checkpoint aggregations (see SPARK-16899), except 
by using a custom foreach-based sink, which is pretty difficult and requires 
that the user deal with ensuring idempotency, versioning, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2016-08-04 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16899:
--
Description: 
The structured streaming checkpointing example at the bottom of the page 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
 has the following excerpt:

{code}
aggDF
   .writeStream
   .outputMode("complete")
   .option(“checkpointLocation”, “path/to/HDFS/dir”)
   .format("memory")
   .start()
{code}

But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
following error: 

{{This query does not support recovering from checkpoint location. Delete 
/tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
over.;}}

The documentation should be changed to demonstrate checkpointing for a 
non-aggregation streaming task, and explicitly mention there is no way to 
checkpoint aggregates.

  was:
The structured streaming checkpointing example at the bottom of the page 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
 has the following excerpt:
```
aggDF
   .writeStream
   .outputMode("complete")
   .option(“checkpointLocation”, “path/to/HDFS/dir”)
   .format("memory")
   .start()
```

But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
following error: 
```
This query does not support recovering from checkpoint location. Delete 
/tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
over.;
```

The documentation should be changed to demonstrate checkpointing for a 
non-aggregation streaming task, and explicitly mention there is no way to 
checkpoint aggregates.


> Structured Streaming Checkpointing Example invalid
> --
>
> Key: SPARK-16899
> URL: https://issues.apache.org/jira/browse/SPARK-16899
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Vladimir Feinberg
>Priority: Critical
>
> The structured streaming checkpointing example at the bottom of the page 
> (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
>  has the following excerpt:
> {code}
> aggDF
>.writeStream
>.outputMode("complete")
>.option(“checkpointLocation”, “path/to/HDFS/dir”)
>.format("memory")
>.start()
> {code}
> But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
> following error: 
> {{This query does not support recovering from checkpoint location. Delete 
> /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
> over.;}}
> The documentation should be changed to demonstrate checkpointing for a 
> non-aggregation streaming task, and explicitly mention there is no way to 
> checkpoint aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2016-08-05 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16920:
-

 Summary: Investigate and fix issues introduced in SPARK-15858
 Key: SPARK-16920
 URL: https://issues.apache.org/jira/browse/SPARK-16920
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Vladimir Feinberg


There were several issues regarding the PR resolving SPARK-15858, my comments 
are available here:

https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93

The two most important issues are:

1. The PR did not add a stress test proving it resolved the issue it was 
supposed to (though I have no doubt the optimization made is indeed correct).
2. The PR introduced quadratic prediction time in terms of the number of trees, 
which was previously linear. This issue needs to be investigated for whether it 
causes problems for large numbers of trees (say, 1000), an appropriate test 
should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412438#comment-15412438
 ] 

Vladimir Feinberg commented on SPARK-12381:
---

[~sethah] Just so we don't clash, I think these two JIRAs are overlapping: 
SPARK-16728

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16957) Use weighted midpoints for split values.

2016-08-08 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16957:
-

 Summary: Use weighted midpoints for split values.
 Key: SPARK-16957
 URL: https://issues.apache.org/jira/browse/SPARK-16957
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Vladimir Feinberg


Just like R's gbm, we should be using weighted split points rather than the 
actual continuous binned feature values. For instance, in a dataset containing 
binary features (that are fed in as continuous ones), our splits are selected 
as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness 
qualities, this is asymptotically bad compared to GBM's approach. The split 
point should be a weighted split point of the two values of the "innermost" 
feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split 
should be at {{0.75}}.

Example:
{code}
+++-+-+
|feature0|feature1|label|count|
+++-+-+
| 0.0| 0.0|  0.0|   23|
| 1.0| 0.0|  0.0|2|
| 0.0| 0.0|  1.0|2|
| 0.0| 1.0|  0.0|7|
| 1.0| 0.0|  1.0|   23|
| 0.0| 1.0|  1.0|   18|
| 1.0| 1.0|  1.0|7|
| 1.0| 1.0|  0.0|   18|
+++-+-+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
Predict: -0.56
   Else (feature 1 > 0.0)
Predict: 0.29333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
Predict: 0.56
   Else (feature 1 > 0.0)
Predict: -0.29333
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.

2016-08-08 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16957:
--
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-14045)

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> Just like R's gbm, we should be using weighted split points rather than the 
> actual continuous binned feature values. For instance, in a dataset 
> containing binary features (that are fed in as continuous ones), our splits 
> are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some 
> smoothness qualities, this is asymptotically bad compared to GBM's approach. 
> The split point should be a weighted split point of the two values of the 
> "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, 
> the above split should be at {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-08-08 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412459#comment-15412459
 ] 

Vladimir Feinberg commented on SPARK-12381:
---

Yeah, that'd be a good idea.

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16969) GBTClassifier needs a raw prediction column

2016-08-09 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16969:
-

 Summary: GBTClassifier needs a raw prediction column
 Key: SPARK-16969
 URL: https://issues.apache.org/jira/browse/SPARK-16969
 Project: Spark
  Issue Type: Bug
Reporter: Vladimir Feinberg


When working with a skewed-label dataset I found the GBTClassifier pretty 
unusable because it performs an automatic thresholding at the halfway point, 
without exposing the raw prediction.

This prevents use of different thresholds and any of the 
BinaryClassificationEvaluator tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16900) Complete-mode output for file sinks

2016-08-09 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414455#comment-15414455
 ] 

Vladimir Feinberg commented on SPARK-16900:
---

Alternatively, if we could have some way of altering the aggregation sinks to 
have versioned append-mode output that'd be just as good


> Complete-mode output for file sinks
> ---
>
> Key: SPARK-16900
> URL: https://issues.apache.org/jira/browse/SPARK-16900
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Vladimir Feinberg
>
> Currently there is no way to checkpoint aggregations (see SPARK-16899), 
> except by using a custom foreach-based sink, which is pretty difficult and 
> requires that the user deal with ensuring idempotency, versioning, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15809) PySpark SQL UDF default returnType

2016-06-07 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15809:
-

 Summary: PySpark SQL UDF default returnType
 Key: SPARK-15809
 URL: https://issues.apache.org/jira/browse/SPARK-15809
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Vladimir Feinberg
Priority: Minor


The current signature for the pyspark UDF creation function is:

{code:python}
pyspark.sql.functions.udf(f, returnType=StringType)
{code}

Is there a reason that there's a default parameter for {{returnType}}? 
Returning a string by default doesn't strike me as so much more a frequent use 
case than, say, returning an integer to merit the default.

In fact, it seems the only reason that the default was chosen is that if we 
*had to choose* a default type, it would be a {{StringType}} because that's 
what we can implicitly convert everything to.

But this only seems to do two things to me: (1) cause unintentional, annoying 
conversions to strings for new users and (2) make call sites less consistent 
(if people drop the type specification to actually use the default).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15888:
-

 Summary: UDF fails in Python
 Key: SPARK-15888
 URL: https://issues.apache.org/jira/browse/SPARK-15888
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Vladimir Feinberg


This looks like a regression from 1.6.1.

The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
in 2.0.0:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15971:
-

 Summary: GroupedData's member incorrectly named
 Key: SPARK-15971
 URL: https://issues.apache.org/jira/browse/SPARK-15971
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as 
the member variable [[self._jdf]], which is exactly the same as 
[[pyspark.sql.DataFrame]], when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - 
in fact, in the [[DataFrame.groupBy]] implementation, the java object is 
referred to as exactly [[jgd]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15972:
-

 Summary: GroupedData varargs arguments misnamed
 Key: SPARK-15972
 URL: https://issues.apache.org/jira/browse/SPARK-15972
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


Simple aggregation functions which take column names [[cols]] as varargs 
arguments show up in documentation with the argument [[args]], but their 
documentation refers to [[cols]].

The discrepancy is caused by an annotation, [[df_varargs_api]], which produces 
a temporary function with arguments [[args]] instead of [[cols]], creating the 
confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15973) GroupedData.pivot documentation off

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15973:
-

 Summary: GroupedData.pivot documentation off
 Key: SPARK-15973
 URL: https://issues.apache.org/jira/browse/SPARK-15973
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15971:
--
Description: 
The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as 
the member variable {{self._jdf}}, which is exactly the same as 
{{pyspark.sql.DataFrame}}, when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - 
in fact, in the {{DataFrame.groupBy}} implementation, the java object is 
referred to as exactly {{jgd}}

  was:
The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as 
the member variable [[self._jdf]], which is exactly the same as 
[[pyspark.sql.DataFrame]], when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - 
in fact, in the [[DataFrame.groupBy]] implementation, the java object is 
referred to as exactly [[jgd]]


> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15972:
--
Description: 
Simple aggregation functions which take column names {{cols}} as varargs 
arguments show up in documentation with the argument {{args}}, but their 
documentation refers to {{cols}}.

The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces 
a temporary function with arguments {{args}} instead of {{cols}}, creating the 
confusing documentation.


  was:
Simple aggregation functions which take column names [[cols]] as varargs 
arguments show up in documentation with the argument [[args]], but their 
documentation refers to [[cols]].

The discrepancy is caused by an annotation, [[df_varargs_api]], which produces 
a temporary function with arguments [[args]] instead of [[cols]], creating the 
confusing documentation.


> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15973:
--
Summary: Fix GroupedData Documentation  (was: GroupedData.pivot 
documentation off)

> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15973:
--
Description: 
(1)

{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.

(2)

Simple aggregation functions which take column names {{cols}} as varargs 
arguments show up in documentation with the argument {{args}}, but their 
documentation refers to {{cols}}.

The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces 
a temporary function with arguments {{args}} instead of {{cols}}, creating the 
confusing documentation.

(3)

The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as 
the member variable {{self._jdf}}, which is exactly the same as 
{{pyspark.sql.DataFrame}}, when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - 
in fact, in the {{DataFrame.groupBy}} implementation, the java object is 
referred to as exactly {{jgd}}

  was:
{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.


> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> (1)
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.
> (2)
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.
> (3)
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg resolved SPARK-15972.
---
Resolution: Duplicate

> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg closed SPARK-15972.
-

> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg closed SPARK-15971.
-

> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg resolved SPARK-15971.
---
Resolution: Duplicate

> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332374#comment-15332374
 ] 

Vladimir Feinberg commented on SPARK-15973:
---

Done

> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> (1)
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.
> (2)
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.
> (3)
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types

2016-06-16 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15989:
-

 Summary: PySpark SQL python-only UDTs don't support nested types
 Key: SPARK-15989
 URL: https://issues.apache.org/jira/browse/SPARK-15989
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Vladimir Feinberg
Priority: Blocker


[This 
notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html]
 demonstrates the bug.

The obvious issue is that nested UDTs are not supported if the UDT is 
Python-only. Looking at the exception thrown, this seems to be because the 
encoder on the Java end tries to encode the UDT as a Java class, which doesn't 
exist for the [[PythonOnlyUDT]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15993) PySpark RuntimeConfig should be immutable

2016-06-16 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15993:
-

 Summary: PySpark RuntimeConfig should be immutable
 Key: SPARK-15993
 URL: https://issues.apache.org/jira/browse/SPARK-15993
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Vladimir Feinberg
Priority: Trivial


{{pyspark.sql.RuntimeConfig}} should be immutable because changing its value 
does nothing, which only leads to a confusing API (I tried to change a config 
param by extracting the config of a running {{SparkSession}}, but failed to 
realize I need to relaunch it).

Furthermore, {{RuntimeConfig}} is unlike {{SparkConf}} in that it can't ever be 
used to specify a configuration when building a {{SparkSession}} anyway - it's 
only purpose is to figure out an existing session's params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types

2016-06-17 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15989:
--
Component/s: SQL

> PySpark SQL python-only UDTs don't support nested types
> ---
>
> Key: SPARK-15989
> URL: https://issues.apache.org/jira/browse/SPARK-15989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> [This 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html]
>  demonstrates the bug.
> The obvious issue is that nested UDTs are not supported if the UDT is 
> Python-only. Looking at the exception thrown, this seems to be because the 
> encoder on the Java end tries to encode the UDT as a Java class, which 
> doesn't exist for the [[PythonOnlyUDT]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15993) PySpark RuntimeConfig should be immutable

2016-06-17 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336381#comment-15336381
 ] 

Vladimir Feinberg commented on SPARK-15993:
---

So the intent is that changing {{RuntimeConfig}} via its {{set}} should change 
the current {{SparkSession}}'s actual settings? Right now that class is just 
some dictionary, with no connection to the spark context at all - it wouldn't 
be a bug, but rather just something completely unimplemented.

> PySpark RuntimeConfig should be immutable
> -
>
> Key: SPARK-15993
> URL: https://issues.apache.org/jira/browse/SPARK-15993
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> {{pyspark.sql.RuntimeConfig}} should be immutable because changing its value 
> does nothing, which only leads to a confusing API (I tried to change a config 
> param by extracting the config of a running {{SparkSession}}, but failed to 
> realize I need to relaunch it).
> Furthermore, {{RuntimeConfig}} is unlike {{SparkConf}} in that it can't ever 
> be used to specify a configuration when building a {{SparkSession}} anyway - 
> it's only purpose is to figure out an existing session's params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16175) Handle None for all Python UDT

2016-06-23 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16175:
--
Attachment: nullvector.dbc

databricks nb demonstrating the issue

> Handle None for all Python UDT
> --
>
> Key: SPARK-16175
> URL: https://issues.apache.org/jira/browse/SPARK-16175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Davies Liu
> Attachments: nullvector.dbc
>
>
> For Scala UDT, we will not call serialize()/deserialize() for all null, we 
> should also do that in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16179) UDF explosion yielding empty dataframe fails

2016-06-23 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16179:
-

 Summary: UDF explosion yielding empty dataframe fails
 Key: SPARK-16179
 URL: https://issues.apache.org/jira/browse/SPARK-16179
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Vladimir Feinberg


Command to replicate 
https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0

Resulting failure
https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16237) PySpark gapply

2016-06-27 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16237:
-

 Summary: PySpark gapply
 Key: SPARK-16237
 URL: https://issues.apache.org/jira/browse/SPARK-16237
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Vladimir Feinberg


To maintain feature parity, `gapply` functionality should be added to 
`pyspark`'s  `GroupedData` with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a GroupedData method, the first argument now 
to be called self).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16237) PySpark gapply

2016-06-28 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16237:
--
Description: 
To maintain feature parity, {{gapply}} functionality should be added to 
PySpark's  {{GroupedData}} with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a {{GroupedData}} method, the first argument 
now to be called self).

  was:
To maintain feature parity, `gapply` functionality should be added to PySpark's 
 {{GroupedData}} with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a {{GroupedData}} method, the first argument 
now to be called self).


> PySpark gapply
> --
>
> Key: SPARK-16237
> URL: https://issues.apache.org/jira/browse/SPARK-16237
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>
> To maintain feature parity, {{gapply}} functionality should be added to 
> PySpark's  {{GroupedData}} with an interface.
> The implementation already exists because it fulfilled a need in another 
> package: 
> https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py
> It needs to be migrated (to become a {{GroupedData}} method, the first 
> argument now to be called self).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16237) PySpark gapply

2016-06-28 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16237:
--
Description: 
To maintain feature parity, `gapply` functionality should be added to PySpark's 
 {{GroupedData}} with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a {{GroupedData}} method, the first argument 
now to be called self).

  was:
To maintain feature parity, `gapply` functionality should be added to 
`pyspark`'s  `GroupedData` with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a GroupedData method, the first argument now 
to be called self).


> PySpark gapply
> --
>
> Key: SPARK-16237
> URL: https://issues.apache.org/jira/browse/SPARK-16237
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>
> To maintain feature parity, `gapply` functionality should be added to 
> PySpark's  {{GroupedData}} with an interface.
> The implementation already exists because it fulfilled a need in another 
> package: 
> https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py
> It needs to be migrated (to become a {{GroupedData}} method, the first 
> argument now to be called self).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16237) PySpark gapply

2016-06-28 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353495#comment-15353495
 ] 

Vladimir Feinberg commented on SPARK-16237:
---

cc [~mengxr] [~thunterdb] [~josephkb] Comments re exposing {{gapply()}}?

> PySpark gapply
> --
>
> Key: SPARK-16237
> URL: https://issues.apache.org/jira/browse/SPARK-16237
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>
> To maintain feature parity, `gapply` functionality should be added to 
> `pyspark`'s  `GroupedData` with an interface.
> The implementation already exists because it fulfilled a need in another 
> package: 
> https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py
> It needs to be migrated (to become a GroupedData method, the first argument 
> now to be called self).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark

2016-06-28 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16262:
-

 Summary: Impossible to remake new SparkContext using SparkSession 
API in Pyspark
 Key: SPARK-16262
 URL: https://issues.apache.org/jira/browse/SPARK-16262
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Vladimir Feinberg


There are multiple use cases where one might like to be able to stop and 
re-start a {{SparkSession}}: configuration changes or modular testing. The 
following code demonstrates that without clearing a hidden global 
{{SparkSession._instantiatedContext = None}} it is impossible to re-create a 
new Spark session after stopping one in the same process:

{code}
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a 
loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
enp0s31f6)
16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
>>> spark.stop()
>>> spark = SparkSession.builder.getOrCreate()
>>> spark.createDataFrame([(1,)])
Traceback (most recent call last):
  File "", line 1, in 
  File "pyspark/sql/session.py", line 514, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "pyspark/sql/session.py", line 394, in _createFromLocal
return self._sc.parallelize(data), schema
  File "pyspark/context.py", line 410, in parallelize
numSlices = int(numSlices) if numSlices is not None else 
self.defaultParallelism
  File "pyspark/context.py", line 346, in defaultParallelism
return self._jsc.sc().defaultParallelism()
AttributeError: 'NoneType' object has no attribute 'sc'
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16263) SparkSession caches configuration in an unituitive global way

2016-06-28 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16263:
-

 Summary: SparkSession caches configuration in an unituitive global 
way
 Key: SPARK-16263
 URL: https://issues.apache.org/jira/browse/SPARK-16263
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Vladimir Feinberg






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way

2016-06-28 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16263:
--
Description: 
The following use case demonstrates the issue. 
cls.spark = SparkSession.builder \
.config("spark.sql.retainGroupColumns", 
"false") \
.getOrCreate()

  was:The following use case demonstrates the issue. 


> SparkSession caches configuration in an unituitive global way
> -
>
> Key: SPARK-16263
> URL: https://issues.apache.org/jira/browse/SPARK-16263
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Vladimir Feinberg
>
> The following use case demonstrates the issue. 
> cls.spark = SparkSession.builder \
> .config("spark.sql.retainGroupColumns", 
> "false") \
> .getOrCreate()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way

2016-06-28 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16263:
--
Description: The following use case demonstrates the issue. 

> SparkSession caches configuration in an unituitive global way
> -
>
> Key: SPARK-16263
> URL: https://issues.apache.org/jira/browse/SPARK-16263
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Vladimir Feinberg
>
> The following use case demonstrates the issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way

2016-06-28 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16263:
--
Description: 
The following use case demonstrates the issue. Note that as a workaround to 
SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}.

{code} 
>>> from pyspark.sql import SparkSession
>>> def reset_spark(): global spark; spark.stop(); 
>>> SparkSession._instantiatedContext = None
... 
>>> spark = SparkSession.builder.getOrCreate()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a 
loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
enp0s31f6)
16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
>>> spark.conf.get("spark.sql.retainGroupColumns")
u'true'
>>> reset_spark()
>>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", 
>>> "false").getOrCreate()
>>> spark.conf.get("spark.sql.retainGroupColumns")
u'false'
>>> reset_spark()
>>> spark = SparkSession.builder.getOrCreate()
>>> spark.conf.get("spark.sql.retainGroupColumns")
u'false'
>>> 
{code}

The last line should output {{u'true'}} instead - there is absolutely no 
expectation for global config state to persist across sessions, which should 
use default configuration unless deviated from in each session's specific 
builder.

  was:
The following use case demonstrates the issue. 
cls.spark = SparkSession.builder \
.config("spark.sql.retainGroupColumns", 
"false") \
.getOrCreate()


> SparkSession caches configuration in an unituitive global way
> -
>
> Key: SPARK-16263
> URL: https://issues.apache.org/jira/browse/SPARK-16263
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Vladimir Feinberg
>
> The following use case demonstrates the issue. Note that as a workaround to 
> SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}.
> {code} 
> >>> from pyspark.sql import SparkSession
> >>> def reset_spark(): global spark; spark.stop(); 
> >>> SparkSession._instantiatedContext = None
> ... 
> >>> spark = SparkSession.builder.getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a 
> loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
> enp0s31f6)
> 16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'true'
> >>> reset_spark()
> >>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", 
> >>> "false").getOrCreate()
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'false'
> >>> reset_spark()
> >>> spark = SparkSession.builder.getOrCreate()
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'false'
> >>> 
> {code}
> The last line should output {{u'true'}} instead - there is absolutely no 
> expectation for global config state to persist across sessions, which should 
> use default configuration unless deviated from in each session's specific 
> builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark

2016-06-28 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353630#comment-15353630
 ] 

Vladimir Feinberg commented on SPARK-16262:
---

What do you mean by "clearing that variable"? Are you referring to setting 
{{SparkSession._instantiatedContext = None}}? The issue with that is that I 
definitely see a user wanting to change the configuration of a spark session 
within a single process, but I don't think it's reasonable to expect them to 
set a hidden variable to {{None}}. Based on another bug I opened (SPARK-16263), 
this seems to be a general issue with global context in the API.

> Impossible to remake new SparkContext using SparkSession API in Pyspark
> ---
>
> Key: SPARK-16262
> URL: https://issues.apache.org/jira/browse/SPARK-16262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> There are multiple use cases where one might like to be able to stop and 
> re-start a {{SparkSession}}: configuration changes or modular testing. The 
> following code demonstrates that without clearing a hidden global 
> {{SparkSession._instantiatedContext = None}} it is impossible to re-create a 
> new Spark session after stopping one in the same process:
> {code}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a 
> loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
> enp0s31f6)
> 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> >>> spark.stop()
> >>> spark = SparkSession.builder.getOrCreate()
> >>> spark.createDataFrame([(1,)])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyspark/sql/session.py", line 514, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "pyspark/sql/session.py", line 394, in _createFromLocal
> return self._sc.parallelize(data), schema
>   File "pyspark/context.py", line 410, in parallelize
> numSlices = int(numSlices) if numSlices is not None else 
> self.defaultParallelism
>   File "pyspark/context.py", line 346, in defaultParallelism
> return self._jsc.sc().defaultParallelism()
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark

2016-06-28 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353642#comment-15353642
 ] 

Vladimir Feinberg commented on SPARK-16262:
---

Ah, are you suggesting that line should be inside of {{SparkSession.stop()}}? 
I'm totally OK with that, but then that's a fix for this bug, right? As in, 
your comment wasn't a contention you had with the JIRA itself?

> Impossible to remake new SparkContext using SparkSession API in Pyspark
> ---
>
> Key: SPARK-16262
> URL: https://issues.apache.org/jira/browse/SPARK-16262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> There are multiple use cases where one might like to be able to stop and 
> re-start a {{SparkSession}}: configuration changes or modular testing. The 
> following code demonstrates that without clearing a hidden global 
> {{SparkSession._instantiatedContext = None}} it is impossible to re-create a 
> new Spark session after stopping one in the same process:
> {code}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a 
> loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
> enp0s31f6)
> 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> >>> spark.stop()
> >>> spark = SparkSession.builder.getOrCreate()
> >>> spark.createDataFrame([(1,)])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyspark/sql/session.py", line 514, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "pyspark/sql/session.py", line 394, in _createFromLocal
> return self._sc.parallelize(data), schema
>   File "pyspark/context.py", line 410, in parallelize
> numSlices = int(numSlices) if numSlices is not None else 
> self.defaultParallelism
>   File "pyspark/context.py", line 346, in defaultParallelism
> return self._jsc.sc().defaultParallelism()
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16263) SparkSession caches configuration in an unituitive global way

2016-06-28 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353658#comment-15353658
 ] 

Vladimir Feinberg commented on SPARK-16263:
---

Right, I'm not arguing for the need for multiple sessions at once, but I think 
it's reasonable to expect this global state to have some notion of idempotency. 
I think whatever we do the restrictions on the use case must be enforced by the 
API. If I'm really only ever allowed to invoke SparkSession creation once, then 
the builder should raise on the second time (and building a session should be a 
process independent of getOrCreate()-ing it).

On the other hand, if we're ok with the one-spark-session-at-a-time (which the 
code is mostly in line with already), then it's just a matter of clearing the 
global variables on shutdown.

> SparkSession caches configuration in an unituitive global way
> -
>
> Key: SPARK-16263
> URL: https://issues.apache.org/jira/browse/SPARK-16263
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> The following use case demonstrates the issue. Note that as a workaround to 
> SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}.
> {code} 
> >>> from pyspark.sql import SparkSession
> >>> def reset_spark(): global spark; spark.stop(); 
> >>> SparkSession._instantiatedContext = None
> ... 
> >>> spark = SparkSession.builder.getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a 
> loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
> enp0s31f6)
> 16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'true'
> >>> reset_spark()
> >>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", 
> >>> "false").getOrCreate()
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'false'
> >>> reset_spark()
> >>> spark = SparkSession.builder.getOrCreate()
> >>> spark.conf.get("spark.sql.retainGroupColumns")
> u'false'
> >>> 
> {code}
> The last line should output {{u'true'}} instead - there is absolutely no 
> expectation for global config state to persist across sessions, which should 
> use default configuration unless deviated from in each session's specific 
> builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark

2016-06-28 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353659#comment-15353659
 ] 

Vladimir Feinberg commented on SPARK-16262:
---

Sure, I think we're agreeing.

> Impossible to remake new SparkContext using SparkSession API in Pyspark
> ---
>
> Key: SPARK-16262
> URL: https://issues.apache.org/jira/browse/SPARK-16262
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> There are multiple use cases where one might like to be able to stop and 
> re-start a {{SparkSession}}: configuration changes or modular testing. The 
> following code demonstrates that without clearing a hidden global 
> {{SparkSession._instantiatedContext = None}} it is impossible to re-create a 
> new Spark session after stopping one in the same process:
> {code}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a 
> loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface 
> enp0s31f6)
> 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> >>> spark.stop()
> >>> spark = SparkSession.builder.getOrCreate()
> >>> spark.createDataFrame([(1,)])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyspark/sql/session.py", line 514, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "pyspark/sql/session.py", line 394, in _createFromLocal
> return self._sc.parallelize(data), schema
>   File "pyspark/context.py", line 410, in parallelize
> numSlices = int(numSlices) if numSlices is not None else 
> self.defaultParallelism
>   File "pyspark/context.py", line 346, in defaultParallelism
> return self._jsc.sc().defaultParallelism()
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-06-30 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357868#comment-15357868
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

[~sethah] Hi Seth, it seems like your comment is outdated now that GBT is 
indeed in ML. Are you currently working on this?


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2016-07-05 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362721#comment-15362721
 ] 

Vladimir Feinberg commented on SPARK-4240:
--

Sorry for delay in response - I was on vacation for the long weekend. Would you 
mind pushing or linking what you have done so far? I'll get back to you 
tomorrow on whether I have the bandwidth to tackle this right now.

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15575) Remove breeze from dependencies?

2016-08-30 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450836#comment-15450836
 ] 

Vladimir Feinberg commented on SPARK-15575:
---

Some of the biggest issues with Breeze perf I've experienced is that a lot of 
operations you'd expect it to be fast for are not; and it's pretty syntax and 
heavy use of implicits makes it easy to accidentally use this.

For instance:
1. Mixed dense/sparse operations frequently resort to a generic implementation 
in breeze that uses its Scala iterators.
2. Creation of vectors, under certain operations, will result in unnecessary 
boxing of doubles (and integers, for sparse vectors).
3. Slice vectors have no support for efficient operations. They are implemented 
in breeze in a way that makes them no better than Array[Double], which again 
makes us use Scala iterators whenever we want a fast, vectorized dot product, 
for instance.

Usability is tough sometimes. Even though a Vector[Double] interface seems 
flexible, a lot of implementations require an explicit knowledge of the vector 
type (Sparse/dense), else breeze silently uses the slow Scala interface. Heavy 
use of implicits is also a problem here, since they're not implemented for all 
permutations of vector types.

It's also easy to do, e.g. val `vec1 += vec2 * a * b`. This will create two 
intermediate vectors.

I think the biggest issue is that `ml.linalg.Vector` is Breeze-backed. We 
should use our own linear algebra (we do have `BLAS`, though to support slicing 
this interface would have to be expanded) and move around `ArrayView[Double]` 
inside the vector instead.

Breeze as a dependency, as mentioned below, is very useful still for 
optimization. I think we can keep it around for that, as long as it's only for 
that.

> Remove breeze from dependencies?
> 
>
> Key: SPARK-15575
> URL: https://issues.apache.org/jira/browse/SPARK-15575
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing whether we should remove Breeze from the 
> dependencies of MLlib.  The main issues with Breeze are Scala 2.12 support 
> and performance issues.
> There are a few paths:
> # Keep dependency.  This could be OK, especially if the Scala version issues 
> are fixed within Breeze.
> # Remove dependency
> ## Implement our own linear algebra operators as needed
> ## Design a way to build Spark using custom linalg libraries of the user's 
> choice.  E.g., you could build MLlib using Breeze, or any other library 
> supporting the required operations.  This might require significant work.  
> See [SPARK-6442] for related discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml

2016-09-12 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16728:
--
Description: 
Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.
3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and 
Impurity into a single class (and use virtual calls rather than case statements 
when toggling over impurity types).


  was:
Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.


> migrate internal API for MLlib trees from spark.mllib to spark.ml
> -
>
> Key: SPARK-16728
> URL: https://issues.apache.org/jira/browse/SPARK-16728
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> Currently, spark.ml trees rely on spark.mllib implementations. There are two 
> issues with this:
> 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
> (the previous ensemble's prediction, for instance) inside the TreePoints 
> (this is necessary to have loss-based splits for complex loss functions).
> 2. The old impurity API only lets you use summary statistics up to the 2nd 
> order. These are useless for several impurity measures and inadequate for 
> others (e.g., absolute loss or huber loss). It needs some renovation.
> 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, 
> and Impurity into a single class (and use virtual calls rather than case 
> statements when toggling over impurity types).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org