[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502 ] Vladimir Feinberg commented on SPARK-4240: -- Pending some dramatic response from \[~sethah\] telling me to back off, I'll take over this one. \[~josephkb\], mind reviewing the below outline? I propose that this JIRA be resolved in the following manner: API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself. * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired. * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]). * Instantiate the generic loss-reduction calculator (that supports different orders of losses) for regression: ** Add squared and absolute losses ** 'auto' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss ** The former should perform better than LS_Boost from \[1\] (which only uses the first-order approximation) and the latter is equivalent to LAD_TreeBoost from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it performs worse. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to. * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to L2_TreeBoost from \[1\]. * Because the new impurities apply only to GBTs, they'll only be available for them. Questions for \[~josephkb\]: 1. Should I ditch making the second order approximation that \[2\] does? It won't make the code any simpler, but might make the theoretical offerings of the new easier to grasp. This would add another task "try out second order Taylor approx" to the below, and also means we won't perform as well as xgb until the second order thing happens. Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss. Differences between this and \[2\]: * No leaf weight regularization, besides the default constant shrinkage, is implemented. Differences between this and \[3\]: * \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here). * \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization. Probably not for this JIRA: 1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators. 2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)? 3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have. \[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf \[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf \[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf \[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters \[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc \[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h \[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 >
[jira] [Comment Edited] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502 ] Vladimir Feinberg edited comment on SPARK-4240 at 7/7/16 6:03 PM: -- Pending some dramatic response from [~sethah] telling me to back off, I'll take over this one. [~josephkb], mind reviewing the below outline? I propose that this JIRA be resolved in the following manner: API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself. * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired. * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]). * Instantiate the generic loss-reduction calculator (that supports different orders of losses) for regression: ** Add squared and absolute losses ** 'auto' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss ** The former should perform better than LS_Boost from \[1\] (which only uses the first-order approximation) and the latter is equivalent to LAD_TreeBoost from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it performs worse. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to. * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to L2_TreeBoost from \[1\]. * Because the new impurities apply only to GBTs, they'll only be available for them. Questions for [~josephkb]: 1. Should I ditch making the second order approximation that \[2\] does? It won't make the code any simpler, but might make the theoretical offerings of the new easier to grasp. This would add another task "try out second order Taylor approx" to the below, and also means we won't perform as well as xgb until the second order thing happens. Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss. Differences between this and \[2\]: * No leaf weight regularization, besides the default constant shrinkage, is implemented. Differences between this and \[3\]: * \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here). * \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization. Probably not for this JIRA: 1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators. 2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)? 3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have. \[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf \[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf \[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf \[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters \[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc \[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h \[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf was (Author: vlad.feinberg): Pending some dramatic response from \[~sethah\] telling me to back off, I'll take over this one. \[~josephkb\], mind review
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371420#comment-15371420 ] Vladimir Feinberg commented on SPARK-10931: --- [~josephkb] Te intention of this JIRA is a bit confusing. To my understanding, there are three kinds of params: 1. Estimator-related params that only have to do with fitting (e.g., regularization) 2. Independent model and estimator-related params to do with prediction (e.g., number of maximum iterations) 3. Shared model and estimator params that are set once per fitted pipeline (e.g., number of components in PCA). I'd venture that we'd want a model to have: 1. Access to an immutable version of (1) and (3). * In Scala, this is done by having a {{parent}} reference to the generating {{Estimator}}, but this is a reference, so if the estimator changes then the params will, too, inconsistent with the model. It should be copy-on-write (this may be SPARK-7494, I'm not sure). Also, {{parent}} is a mutable reference. * In Python, there is no {{parent}} 2. Access to a mutable version of (2), where mutation should change model behavior * Both languages have this. 3. Separation of concerns. If a parameter falls into categories (1) or (3), it shouldn't be a parameter for the model, since changing its value has no effect except confusion * Both Python and Scala will, as of this JIRA, copy everything - groups (1), (2), (3) - to the model, each with its own version. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16504) UDAF should be typed
Vladimir Feinberg created SPARK-16504: - Summary: UDAF should be typed Key: SPARK-16504 URL: https://issues.apache.org/jira/browse/SPARK-16504 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vladimir Feinberg Currently, UDAFs can be implemented by using a generic {{MutableAggregationBuffer}}. This type-less class requires the user specify the schema. If the user wants to create vector output from a UDAF, this requires specifying an output schema with a VectorUDT(), which is only accessible through a DeveloperApi. Since we would prefer not to expose VectorUDT, the only option would be to resolve the user's inability to (legally) specify a schema containing a VectorUDT the same way that we would do so for creating dataframes: by type inference, just like createDataFrame does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16504) UDAF should be typed
[ https://issues.apache.org/jira/browse/SPARK-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373769#comment-15373769 ] Vladimir Feinberg commented on SPARK-16504: --- fwiw {{merge}} has type {{(MAB, Row):Unit}} instead of {{(MAB, MAB): Unit}} or even more preferably {{(MAB, MAB): MAB}} for some reason. > UDAF should be typed > > > Key: SPARK-16504 > URL: https://issues.apache.org/jira/browse/SPARK-16504 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Vladimir Feinberg > > Currently, UDAFs can be implemented by using a generic > {{MutableAggregationBuffer}}. This type-less class requires the user specify > the schema. > If the user wants to create vector output from a UDAF, this requires > specifying an output schema with a VectorUDT(), which is only accessible > through a DeveloperApi. > Since we would prefer not to expose VectorUDT, the only option would be to > resolve the user's inability to (legally) specify a schema containing a > VectorUDT the same way that we would do so for creating dataframes: by type > inference, just like createDataFrame does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16551) Accumulator Examples should demonstrate different use case from UDAFs
Vladimir Feinberg created SPARK-16551: - Summary: Accumulator Examples should demonstrate different use case from UDAFs Key: SPARK-16551 URL: https://issues.apache.org/jira/browse/SPARK-16551 Project: Spark Issue Type: Documentation Reporter: Vladimir Feinberg Currently, the Spark programming guide demonstrates Accumulators (http://spark.apache.org/docs/latest/programming-guide.html#accumulators) by taking the sum of an RDD. This example makes new users think that Accumulators serve the role that UDAFs do, which they don't. They're meant to be out-of-band, small values that don't break pipe-lining. Documentation examples and notes should reflect this (and warn that they may cause driver bottlenecks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16572) DStream Kinesis Connector Doc formatting
Vladimir Feinberg created SPARK-16572: - Summary: DStream Kinesis Connector Doc formatting Key: SPARK-16572 URL: https://issues.apache.org/jira/browse/SPARK-16572 Project: Spark Issue Type: Documentation Reporter: Vladimir Feinberg Priority: Minor Formatting is off for the Kinesis doc for the old streaming API: https://github.com/apache/spark/blob/05d7151ccbccdd977ec2f2301d5b12566018c988/docs/streaming-kinesis-integration.md The code blocks aren't formatted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16572) DStream Kinesis Connector Doc formatting
[ https://issues.apache.org/jira/browse/SPARK-16572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg closed SPARK-16572. - Resolution: Fixed Layout is just not github-compatible. > DStream Kinesis Connector Doc formatting > > > Key: SPARK-16572 > URL: https://issues.apache.org/jira/browse/SPARK-16572 > Project: Spark > Issue Type: Documentation >Reporter: Vladimir Feinberg >Priority: Minor > > Formatting is off for the Kinesis doc for the old streaming API: > https://github.com/apache/spark/blob/05d7151ccbccdd977ec2f2301d5b12566018c988/docs/streaming-kinesis-integration.md > The code blocks aren't formatted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502 ] Vladimir Feinberg edited comment on SPARK-4240 at 7/22/16 4:47 PM: --- Pending some dramatic response from [~sethah] telling me to back off, I'll take over this one. [~josephkb], mind reviewing the below outline? I propose that this JIRA be resolved in the following manner: API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, the impurity calculator should be derived from the loss function itself. * Set a new default for impurity param in GBTs as 'auto', which uses the loss-based impurity by default, but can be overridden to use standard RFs if desired. * Create a generic loss-reduction calculator which works by reducing a parametrizable loss criterion (or, rather, a Taylor approximation of it as recommended by Friedman \[1\] and implemented to the second order by XGBoost \[2\] \[code: 5\]). * Make loss-reduction calculator for regression: ** Add squared and absolute losses ** 'loss-based' induces a second-order approximation for squared loss, and only a first-order approximation for absolute loss ** The former should perform like LS_Boost from \[1\] and the latter is sort-of (*) equivalent to LAD_TreeBoost from \[1\]. Both these "generic loss" instantiations become new impurities that the user could set, just like 'gini' or 'entropy'. This calculator will implement corresponding terminal-leaf predictions, either the mean or median of the leaf's sample. Computing the median may require modifications to the internal developer API so that at some point the calculator can access the entire set of training samples a terminal node's partition corresponds to. * On the classifier side we need to do the same thing, with a logistic loss inducing a new impurity. Second order here is again feasible. First order corresponds to sort-of (*) L2_TreeBoost from \[1\]. * Because the new impurities apply only to GBTs, they'll only be available for them. (*) A note regarding the sort-of equivalence with Friedman: in his 1999 paper, Friedman admits that he's not doing "true" TreeBoost because he builds the tree based on variance reduction of the the residuals. This is exactly what \[3\] does. \[2\] instead builds the tree by optimizing a _taylor approximation_ for the losses, which makes it feasible to efficiently consider many splits in a leaf (because of the additive nature of the approximate loss function). * For logistic, this works really well for XGBoost. * For squared error, both approaches are equivalent * For absolute error, the Taylor approximation can be first-order only (but locally, it's a perfect approximation). I don't think anyone has done even this approximate version of "true" L1 TreeBoost before. It may be necessary to go the way gbm does and use variance impurity, but we'll try it out anyway. Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, logistic to bernoulli. I'll add the aliases to loss. Differences between this and \[2\]: * No leaf weight regularization, besides the default constant shrinkage, is implemented. Differences between this and \[3\]: * \[3\] uses variance impurity for split selection \[code: 6\]. I don't think this is even technically TreeBoost. Such behavior should be emulatable in the new code by overriding impurity='variance' (would be nice to see if we have comparable perf here). * \[3\] implements GBTs for weighted in put data. We don't support data weights, so for both l1 and l2 losses terminal node computations don't need Newton-Raphson optimization. Probably not for this JIRA: 1. Implementing leaf weights (and leaf weight regularization) - probably involves adding a regularization param to GBTs, creating new regularization-aware impurity calculators. 2. In {{RandomForest.scala}} the line {{val requiredSamples = math.max(metadata.maxBins * metadata.maxBins, 1)}} performs row subsampling on our data. I don't know if it's sound from a statistical learning perspective, but this is something that we should take a look at (i.e., does performing a precise sample complexity calculation in the PAC sense lead to better perf)? 3. Add different "losses" corresponding to residual distributions - see all the ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to implement NR optimization. Huber loss is the only one mentioned in \[1\] that we don't yet have. \[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf \[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf \[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf \[4\] xgboost docs: https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters \[5\] https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3
[jira] [Created] (SPARK-16718) gbm-style treeboost
Vladimir Feinberg created SPARK-16718: - Summary: gbm-style treeboost Key: SPARK-16718 URL: https://issues.apache.org/jira/browse/SPARK-16718 Project: Spark Issue Type: Sub-task Reporter: Vladimir Feinberg As an initial minimal change, we should provide TreeBoost as implemented in GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16718: -- Description: As an initial minimal change, we should provide TreeBoost as implemented in GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment was:As an initial minimal change, we should provide TreeBoost as implemented in GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg > > As an initial minimal change, we should provide TreeBoost as implemented in > GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, > tree leafs in GBTs can have loss-optimal predictions for their partition of > the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16718: -- Description: As an initial minimal change, we should provide TreeBoost as implemented in GBM for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment was: As an initial minimal change, we should provide TreeBoost as implemented in GBM for both L1 and L2 losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392914#comment-15392914 ] Vladimir Feinberg commented on SPARK-16718: --- L1 support for loss-based impurity will be delayed until there's a new internal API for GBTs in spark.ml > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml
Vladimir Feinberg created SPARK-16728: - Summary: migrate internal API for MLlib trees from spark.mllib to spark.ml Key: SPARK-16728 URL: https://issues.apache.org/jira/browse/SPARK-16728 Project: Spark Issue Type: Sub-task Reporter: Vladimir Feinberg Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this: 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based splits for complex loss functions). 2. The old impurity API only lets you use summary statistics up to the 2nd order. These are useless for several impurity measures and inadequate for others (e.g., absolute loss or huber loss). It needs some renovation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16739) GBTClassifier should be a Classifier, not Predictor
Vladimir Feinberg created SPARK-16739: - Summary: GBTClassifier should be a Classifier, not Predictor Key: SPARK-16739 URL: https://issues.apache.org/jira/browse/SPARK-16739 Project: Spark Issue Type: Improvement Reporter: Vladimir Feinberg Priority: Minor Should probably wait for SPARK-4240 to be resolved first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16718: -- Description: .As an initial minimal change, we should provide TreeBoost as implemented in GBM for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment was: As an initial minimal change, we should provide TreeBoost as implemented in GBM for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > .As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16860) UDT Stringification Incorrect in PySpark
Vladimir Feinberg created SPARK-16860: - Summary: UDT Stringification Incorrect in PySpark Key: SPARK-16860 URL: https://issues.apache.org/jira/browse/SPARK-16860 Project: Spark Issue Type: Bug Components: PySpark Reporter: Vladimir Feinberg Priority: Minor When using `show()` on a `DataFrame` containing a UDT, Spark doesn't call the appropriate `__str__` method for display. Example: https://gist.github.com/vlad17/baa8e18ed724c4d88436a92ca159dd5b -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16899) Structured Streaming Checkpointing Example invalid
Vladimir Feinberg created SPARK-16899: - Summary: Structured Streaming Checkpointing Example invalid Key: SPARK-16899 URL: https://issues.apache.org/jira/browse/SPARK-16899 Project: Spark Issue Type: Bug Components: Documentation Reporter: Vladimir Feinberg Priority: Critical The structured streaming checkpointing example at the bottom of the page (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) has the following excerpt: ``` aggDF .writeStream .outputMode("complete") .option(“checkpointLocation”, “path/to/HDFS/dir”) .format("memory") .start() ``` But memory sinks are not fault-tolerant. Indeed, trying this out, I get the following error: ``` This query does not support recovering from checkpoint location. Delete /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start over.; ``` The documentation should be changed to demonstrate checkpointing for a non-aggregation streaming task, and explicitly mention there is no way to checkpoint aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16900) Complete-mode output for file sinks
Vladimir Feinberg created SPARK-16900: - Summary: Complete-mode output for file sinks Key: SPARK-16900 URL: https://issues.apache.org/jira/browse/SPARK-16900 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Vladimir Feinberg Currently there is no way to checkpoint aggregations (see SPARK-16899), except by using a custom foreach-based sink, which is pretty difficult and requires that the user deal with ensuring idempotency, versioning, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16899) Structured Streaming Checkpointing Example invalid
[ https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16899: -- Description: The structured streaming checkpointing example at the bottom of the page (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) has the following excerpt: {code} aggDF .writeStream .outputMode("complete") .option(“checkpointLocation”, “path/to/HDFS/dir”) .format("memory") .start() {code} But memory sinks are not fault-tolerant. Indeed, trying this out, I get the following error: {{This query does not support recovering from checkpoint location. Delete /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start over.;}} The documentation should be changed to demonstrate checkpointing for a non-aggregation streaming task, and explicitly mention there is no way to checkpoint aggregates. was: The structured streaming checkpointing example at the bottom of the page (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) has the following excerpt: ``` aggDF .writeStream .outputMode("complete") .option(“checkpointLocation”, “path/to/HDFS/dir”) .format("memory") .start() ``` But memory sinks are not fault-tolerant. Indeed, trying this out, I get the following error: ``` This query does not support recovering from checkpoint location. Delete /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start over.; ``` The documentation should be changed to demonstrate checkpointing for a non-aggregation streaming task, and explicitly mention there is no way to checkpoint aggregates. > Structured Streaming Checkpointing Example invalid > -- > > Key: SPARK-16899 > URL: https://issues.apache.org/jira/browse/SPARK-16899 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Vladimir Feinberg >Priority: Critical > > The structured streaming checkpointing example at the bottom of the page > (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) > has the following excerpt: > {code} > aggDF >.writeStream >.outputMode("complete") >.option(“checkpointLocation”, “path/to/HDFS/dir”) >.format("memory") >.start() > {code} > But memory sinks are not fault-tolerant. Indeed, trying this out, I get the > following error: > {{This query does not support recovering from checkpoint location. Delete > /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start > over.;}} > The documentation should be changed to demonstrate checkpointing for a > non-aggregation streaming task, and explicitly mention there is no way to > checkpoint aggregates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
Vladimir Feinberg created SPARK-16920: - Summary: Investigate and fix issues introduced in SPARK-15858 Key: SPARK-16920 URL: https://issues.apache.org/jira/browse/SPARK-16920 Project: Spark Issue Type: Bug Components: MLlib Reporter: Vladimir Feinberg There were several issues regarding the PR resolving SPARK-15858, my comments are available here: https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 The two most important issues are: 1. The PR did not add a stress test proving it resolved the issue it was supposed to (though I have no doubt the optimization made is indeed correct). 2. The PR introduced quadratic prediction time in terms of the number of trees, which was previously linear. This issue needs to be investigated for whether it causes problems for large numbers of trees (say, 1000), an appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412438#comment-15412438 ] Vladimir Feinberg commented on SPARK-12381: --- [~sethah] Just so we don't clash, I think these two JIRAs are overlapping: SPARK-16728 > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16957) Use weighted midpoints for split values.
Vladimir Feinberg created SPARK-16957: - Summary: Use weighted midpoints for split values. Key: SPARK-16957 URL: https://issues.apache.org/jira/browse/SPARK-16957 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Vladimir Feinberg Just like R's gbm, we should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16957: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-14045) > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412459#comment-15412459 ] Vladimir Feinberg commented on SPARK-12381: --- Yeah, that'd be a good idea. > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16969) GBTClassifier needs a raw prediction column
Vladimir Feinberg created SPARK-16969: - Summary: GBTClassifier needs a raw prediction column Key: SPARK-16969 URL: https://issues.apache.org/jira/browse/SPARK-16969 Project: Spark Issue Type: Bug Reporter: Vladimir Feinberg When working with a skewed-label dataset I found the GBTClassifier pretty unusable because it performs an automatic thresholding at the halfway point, without exposing the raw prediction. This prevents use of different thresholds and any of the BinaryClassificationEvaluator tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16900) Complete-mode output for file sinks
[ https://issues.apache.org/jira/browse/SPARK-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414455#comment-15414455 ] Vladimir Feinberg commented on SPARK-16900: --- Alternatively, if we could have some way of altering the aggregation sinks to have versioned append-mode output that'd be just as good > Complete-mode output for file sinks > --- > > Key: SPARK-16900 > URL: https://issues.apache.org/jira/browse/SPARK-16900 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Vladimir Feinberg > > Currently there is no way to checkpoint aggregations (see SPARK-16899), > except by using a custom foreach-based sink, which is pretty difficult and > requires that the user deal with ensuring idempotency, versioning, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15809) PySpark SQL UDF default returnType
Vladimir Feinberg created SPARK-15809: - Summary: PySpark SQL UDF default returnType Key: SPARK-15809 URL: https://issues.apache.org/jira/browse/SPARK-15809 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Vladimir Feinberg Priority: Minor The current signature for the pyspark UDF creation function is: {code:python} pyspark.sql.functions.udf(f, returnType=StringType) {code} Is there a reason that there's a default parameter for {{returnType}}? Returning a string by default doesn't strike me as so much more a frequent use case than, say, returning an integer to merit the default. In fact, it seems the only reason that the default was chosen is that if we *had to choose* a default type, it would be a {{StringType}} because that's what we can implicitly convert everything to. But this only seems to do two things to me: (1) cause unintentional, annoying conversions to strings for new users and (2) make call sites less consistent (if people drop the type specification to actually use the default). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15888) UDF fails in Python
Vladimir Feinberg created SPARK-15888: - Summary: UDF fails in Python Key: SPARK-15888 URL: https://issues.apache.org/jira/browse/SPARK-15888 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Vladimir Feinberg This looks like a regression from 1.6.1. The following notebook runs without error in a Spark 1.6.1 cluster, but fails in 2.0.0: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15971) GroupedData's member incorrectly named
Vladimir Feinberg created SPARK-15971: - Summary: GroupedData's member incorrectly named Key: SPARK-15971 URL: https://issues.apache.org/jira/browse/SPARK-15971 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as the member variable [[self._jdf]], which is exactly the same as [[pyspark.sql.DataFrame]], when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - in fact, in the [[DataFrame.groupBy]] implementation, the java object is referred to as exactly [[jgd]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15972) GroupedData varargs arguments misnamed
Vladimir Feinberg created SPARK-15972: - Summary: GroupedData varargs arguments misnamed Key: SPARK-15972 URL: https://issues.apache.org/jira/browse/SPARK-15972 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial Simple aggregation functions which take column names [[cols]] as varargs arguments show up in documentation with the argument [[args]], but their documentation refers to [[cols]]. The discrepancy is caused by an annotation, [[df_varargs_api]], which produces a temporary function with arguments [[args]] instead of [[cols]], creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15973) GroupedData.pivot documentation off
Vladimir Feinberg created SPARK-15973: - Summary: GroupedData.pivot documentation off Key: SPARK-15973 URL: https://issues.apache.org/jira/browse/SPARK-15973 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15971: -- Description: The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as the member variable {{self._jdf}}, which is exactly the same as {{pyspark.sql.DataFrame}}, when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the java object is referred to as exactly {{jgd}} was: The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as the member variable [[self._jdf]], which is exactly the same as [[pyspark.sql.DataFrame]], when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - in fact, in the [[DataFrame.groupBy]] implementation, the java object is referred to as exactly [[jgd]] > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15972: -- Description: Simple aggregation functions which take column names {{cols}} as varargs arguments show up in documentation with the argument {{args}}, but their documentation refers to {{cols}}. The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces a temporary function with arguments {{args}} instead of {{cols}}, creating the confusing documentation. was: Simple aggregation functions which take column names [[cols]] as varargs arguments show up in documentation with the argument [[args]], but their documentation refers to [[cols]]. The discrepancy is caused by an annotation, [[df_varargs_api]], which produces a temporary function with arguments [[args]] instead of [[cols]], creating the confusing documentation. > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15973: -- Summary: Fix GroupedData Documentation (was: GroupedData.pivot documentation off) > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15973: -- Description: (1) {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. (2) Simple aggregation functions which take column names {{cols}} as varargs arguments show up in documentation with the argument {{args}}, but their documentation refers to {{cols}}. The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces a temporary function with arguments {{args}} instead of {{cols}}, creating the confusing documentation. (3) The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as the member variable {{self._jdf}}, which is exactly the same as {{pyspark.sql.DataFrame}}, when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the java object is referred to as exactly {{jgd}} was: {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg resolved SPARK-15972. --- Resolution: Duplicate > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg closed SPARK-15972. - > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg closed SPARK-15971. - > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg resolved SPARK-15971. --- Resolution: Duplicate > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332374#comment-15332374 ] Vladimir Feinberg commented on SPARK-15973: --- Done > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types
Vladimir Feinberg created SPARK-15989: - Summary: PySpark SQL python-only UDTs don't support nested types Key: SPARK-15989 URL: https://issues.apache.org/jira/browse/SPARK-15989 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Vladimir Feinberg Priority: Blocker [This notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html] demonstrates the bug. The obvious issue is that nested UDTs are not supported if the UDT is Python-only. Looking at the exception thrown, this seems to be because the encoder on the Java end tries to encode the UDT as a Java class, which doesn't exist for the [[PythonOnlyUDT]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15993) PySpark RuntimeConfig should be immutable
Vladimir Feinberg created SPARK-15993: - Summary: PySpark RuntimeConfig should be immutable Key: SPARK-15993 URL: https://issues.apache.org/jira/browse/SPARK-15993 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0 Reporter: Vladimir Feinberg Priority: Trivial {{pyspark.sql.RuntimeConfig}} should be immutable because changing its value does nothing, which only leads to a confusing API (I tried to change a config param by extracting the config of a running {{SparkSession}}, but failed to realize I need to relaunch it). Furthermore, {{RuntimeConfig}} is unlike {{SparkConf}} in that it can't ever be used to specify a configuration when building a {{SparkSession}} anyway - it's only purpose is to figure out an existing session's params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types
[ https://issues.apache.org/jira/browse/SPARK-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15989: -- Component/s: SQL > PySpark SQL python-only UDTs don't support nested types > --- > > Key: SPARK-15989 > URL: https://issues.apache.org/jira/browse/SPARK-15989 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg > > [This > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html] > demonstrates the bug. > The obvious issue is that nested UDTs are not supported if the UDT is > Python-only. Looking at the exception thrown, this seems to be because the > encoder on the Java end tries to encode the UDT as a Java class, which > doesn't exist for the [[PythonOnlyUDT]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15993) PySpark RuntimeConfig should be immutable
[ https://issues.apache.org/jira/browse/SPARK-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336381#comment-15336381 ] Vladimir Feinberg commented on SPARK-15993: --- So the intent is that changing {{RuntimeConfig}} via its {{set}} should change the current {{SparkSession}}'s actual settings? Right now that class is just some dictionary, with no connection to the spark context at all - it wouldn't be a bug, but rather just something completely unimplemented. > PySpark RuntimeConfig should be immutable > - > > Key: SPARK-15993 > URL: https://issues.apache.org/jira/browse/SPARK-15993 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > {{pyspark.sql.RuntimeConfig}} should be immutable because changing its value > does nothing, which only leads to a confusing API (I tried to change a config > param by extracting the config of a running {{SparkSession}}, but failed to > realize I need to relaunch it). > Furthermore, {{RuntimeConfig}} is unlike {{SparkConf}} in that it can't ever > be used to specify a configuration when building a {{SparkSession}} anyway - > it's only purpose is to figure out an existing session's params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16175: -- Attachment: nullvector.dbc databricks nb demonstrating the issue > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Davies Liu > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16179) UDF explosion yielding empty dataframe fails
Vladimir Feinberg created SPARK-16179: - Summary: UDF explosion yielding empty dataframe fails Key: SPARK-16179 URL: https://issues.apache.org/jira/browse/SPARK-16179 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Vladimir Feinberg Command to replicate https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 Resulting failure https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16237) PySpark gapply
Vladimir Feinberg created SPARK-16237: - Summary: PySpark gapply Key: SPARK-16237 URL: https://issues.apache.org/jira/browse/SPARK-16237 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Vladimir Feinberg To maintain feature parity, `gapply` functionality should be added to `pyspark`'s `GroupedData` with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a GroupedData method, the first argument now to be called self). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16237) PySpark gapply
[ https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16237: -- Description: To maintain feature parity, {{gapply}} functionality should be added to PySpark's {{GroupedData}} with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a {{GroupedData}} method, the first argument now to be called self). was: To maintain feature parity, `gapply` functionality should be added to PySpark's {{GroupedData}} with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a {{GroupedData}} method, the first argument now to be called self). > PySpark gapply > -- > > Key: SPARK-16237 > URL: https://issues.apache.org/jira/browse/SPARK-16237 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Reporter: Vladimir Feinberg > > To maintain feature parity, {{gapply}} functionality should be added to > PySpark's {{GroupedData}} with an interface. > The implementation already exists because it fulfilled a need in another > package: > https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py > It needs to be migrated (to become a {{GroupedData}} method, the first > argument now to be called self). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16237) PySpark gapply
[ https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16237: -- Description: To maintain feature parity, `gapply` functionality should be added to PySpark's {{GroupedData}} with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a {{GroupedData}} method, the first argument now to be called self). was: To maintain feature parity, `gapply` functionality should be added to `pyspark`'s `GroupedData` with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a GroupedData method, the first argument now to be called self). > PySpark gapply > -- > > Key: SPARK-16237 > URL: https://issues.apache.org/jira/browse/SPARK-16237 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Reporter: Vladimir Feinberg > > To maintain feature parity, `gapply` functionality should be added to > PySpark's {{GroupedData}} with an interface. > The implementation already exists because it fulfilled a need in another > package: > https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py > It needs to be migrated (to become a {{GroupedData}} method, the first > argument now to be called self). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16237) PySpark gapply
[ https://issues.apache.org/jira/browse/SPARK-16237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353495#comment-15353495 ] Vladimir Feinberg commented on SPARK-16237: --- cc [~mengxr] [~thunterdb] [~josephkb] Comments re exposing {{gapply()}}? > PySpark gapply > -- > > Key: SPARK-16237 > URL: https://issues.apache.org/jira/browse/SPARK-16237 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Reporter: Vladimir Feinberg > > To maintain feature parity, `gapply` functionality should be added to > `pyspark`'s `GroupedData` with an interface. > The implementation already exists because it fulfilled a need in another > package: > https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py > It needs to be migrated (to become a GroupedData method, the first argument > now to be called self). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark
Vladimir Feinberg created SPARK-16262: - Summary: Impossible to remake new SparkContext using SparkSession API in Pyspark Key: SPARK-16262 URL: https://issues.apache.org/jira/browse/SPARK-16262 Project: Spark Issue Type: Bug Components: PySpark Reporter: Vladimir Feinberg There are multiple use cases where one might like to be able to stop and re-start a {{SparkSession}}: configuration changes or modular testing. The following code demonstrates that without clearing a hidden global {{SparkSession._instantiatedContext = None}} it is impossible to re-create a new Spark session after stopping one in the same process: {code} >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.getOrCreate() Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface enp0s31f6) 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address >>> spark.stop() >>> spark = SparkSession.builder.getOrCreate() >>> spark.createDataFrame([(1,)]) Traceback (most recent call last): File "", line 1, in File "pyspark/sql/session.py", line 514, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "pyspark/sql/session.py", line 394, in _createFromLocal return self._sc.parallelize(data), schema File "pyspark/context.py", line 410, in parallelize numSlices = int(numSlices) if numSlices is not None else self.defaultParallelism File "pyspark/context.py", line 346, in defaultParallelism return self._jsc.sc().defaultParallelism() AttributeError: 'NoneType' object has no attribute 'sc' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16263) SparkSession caches configuration in an unituitive global way
Vladimir Feinberg created SPARK-16263: - Summary: SparkSession caches configuration in an unituitive global way Key: SPARK-16263 URL: https://issues.apache.org/jira/browse/SPARK-16263 Project: Spark Issue Type: Bug Components: PySpark Reporter: Vladimir Feinberg -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way
[ https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16263: -- Description: The following use case demonstrates the issue. cls.spark = SparkSession.builder \ .config("spark.sql.retainGroupColumns", "false") \ .getOrCreate() was:The following use case demonstrates the issue. > SparkSession caches configuration in an unituitive global way > - > > Key: SPARK-16263 > URL: https://issues.apache.org/jira/browse/SPARK-16263 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Vladimir Feinberg > > The following use case demonstrates the issue. > cls.spark = SparkSession.builder \ > .config("spark.sql.retainGroupColumns", > "false") \ > .getOrCreate() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way
[ https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16263: -- Description: The following use case demonstrates the issue. > SparkSession caches configuration in an unituitive global way > - > > Key: SPARK-16263 > URL: https://issues.apache.org/jira/browse/SPARK-16263 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Vladimir Feinberg > > The following use case demonstrates the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16263) SparkSession caches configuration in an unituitive global way
[ https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16263: -- Description: The following use case demonstrates the issue. Note that as a workaround to SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}. {code} >>> from pyspark.sql import SparkSession >>> def reset_spark(): global spark; spark.stop(); >>> SparkSession._instantiatedContext = None ... >>> spark = SparkSession.builder.getOrCreate() Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface enp0s31f6) 16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address >>> spark.conf.get("spark.sql.retainGroupColumns") u'true' >>> reset_spark() >>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", >>> "false").getOrCreate() >>> spark.conf.get("spark.sql.retainGroupColumns") u'false' >>> reset_spark() >>> spark = SparkSession.builder.getOrCreate() >>> spark.conf.get("spark.sql.retainGroupColumns") u'false' >>> {code} The last line should output {{u'true'}} instead - there is absolutely no expectation for global config state to persist across sessions, which should use default configuration unless deviated from in each session's specific builder. was: The following use case demonstrates the issue. cls.spark = SparkSession.builder \ .config("spark.sql.retainGroupColumns", "false") \ .getOrCreate() > SparkSession caches configuration in an unituitive global way > - > > Key: SPARK-16263 > URL: https://issues.apache.org/jira/browse/SPARK-16263 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Vladimir Feinberg > > The following use case demonstrates the issue. Note that as a workaround to > SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}. > {code} > >>> from pyspark.sql import SparkSession > >>> def reset_spark(): global spark; spark.stop(); > >>> SparkSession._instantiatedContext = None > ... > >>> spark = SparkSession.builder.getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a > loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface > enp0s31f6) > 16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'true' > >>> reset_spark() > >>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", > >>> "false").getOrCreate() > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'false' > >>> reset_spark() > >>> spark = SparkSession.builder.getOrCreate() > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'false' > >>> > {code} > The last line should output {{u'true'}} instead - there is absolutely no > expectation for global config state to persist across sessions, which should > use default configuration unless deviated from in each session's specific > builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353630#comment-15353630 ] Vladimir Feinberg commented on SPARK-16262: --- What do you mean by "clearing that variable"? Are you referring to setting {{SparkSession._instantiatedContext = None}}? The issue with that is that I definitely see a user wanting to change the configuration of a spark session within a single process, but I don't think it's reasonable to expect them to set a hidden variable to {{None}}. Based on another bug I opened (SPARK-16263), this seems to be a general issue with global context in the API. > Impossible to remake new SparkContext using SparkSession API in Pyspark > --- > > Key: SPARK-16262 > URL: https://issues.apache.org/jira/browse/SPARK-16262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Vladimir Feinberg >Priority: Minor > > There are multiple use cases where one might like to be able to stop and > re-start a {{SparkSession}}: configuration changes or modular testing. The > following code demonstrates that without clearing a hidden global > {{SparkSession._instantiatedContext = None}} it is impossible to re-create a > new Spark session after stopping one in the same process: > {code} > >>> from pyspark.sql import SparkSession > >>> spark = SparkSession.builder.getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a > loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface > enp0s31f6) > 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > >>> spark.stop() > >>> spark = SparkSession.builder.getOrCreate() > >>> spark.createDataFrame([(1,)]) > Traceback (most recent call last): > File "", line 1, in > File "pyspark/sql/session.py", line 514, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "pyspark/sql/session.py", line 394, in _createFromLocal > return self._sc.parallelize(data), schema > File "pyspark/context.py", line 410, in parallelize > numSlices = int(numSlices) if numSlices is not None else > self.defaultParallelism > File "pyspark/context.py", line 346, in defaultParallelism > return self._jsc.sc().defaultParallelism() > AttributeError: 'NoneType' object has no attribute 'sc' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353642#comment-15353642 ] Vladimir Feinberg commented on SPARK-16262: --- Ah, are you suggesting that line should be inside of {{SparkSession.stop()}}? I'm totally OK with that, but then that's a fix for this bug, right? As in, your comment wasn't a contention you had with the JIRA itself? > Impossible to remake new SparkContext using SparkSession API in Pyspark > --- > > Key: SPARK-16262 > URL: https://issues.apache.org/jira/browse/SPARK-16262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Vladimir Feinberg >Priority: Minor > > There are multiple use cases where one might like to be able to stop and > re-start a {{SparkSession}}: configuration changes or modular testing. The > following code demonstrates that without clearing a hidden global > {{SparkSession._instantiatedContext = None}} it is impossible to re-create a > new Spark session after stopping one in the same process: > {code} > >>> from pyspark.sql import SparkSession > >>> spark = SparkSession.builder.getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a > loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface > enp0s31f6) > 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > >>> spark.stop() > >>> spark = SparkSession.builder.getOrCreate() > >>> spark.createDataFrame([(1,)]) > Traceback (most recent call last): > File "", line 1, in > File "pyspark/sql/session.py", line 514, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "pyspark/sql/session.py", line 394, in _createFromLocal > return self._sc.parallelize(data), schema > File "pyspark/context.py", line 410, in parallelize > numSlices = int(numSlices) if numSlices is not None else > self.defaultParallelism > File "pyspark/context.py", line 346, in defaultParallelism > return self._jsc.sc().defaultParallelism() > AttributeError: 'NoneType' object has no attribute 'sc' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16263) SparkSession caches configuration in an unituitive global way
[ https://issues.apache.org/jira/browse/SPARK-16263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353658#comment-15353658 ] Vladimir Feinberg commented on SPARK-16263: --- Right, I'm not arguing for the need for multiple sessions at once, but I think it's reasonable to expect this global state to have some notion of idempotency. I think whatever we do the restrictions on the use case must be enforced by the API. If I'm really only ever allowed to invoke SparkSession creation once, then the builder should raise on the second time (and building a session should be a process independent of getOrCreate()-ing it). On the other hand, if we're ok with the one-spark-session-at-a-time (which the code is mostly in line with already), then it's just a matter of clearing the global variables on shutdown. > SparkSession caches configuration in an unituitive global way > - > > Key: SPARK-16263 > URL: https://issues.apache.org/jira/browse/SPARK-16263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Vladimir Feinberg >Priority: Minor > > The following use case demonstrates the issue. Note that as a workaround to > SPARK-16262 I use {{reset_spark()}} to stop the current {{SparkSession}}. > {code} > >>> from pyspark.sql import SparkSession > >>> def reset_spark(): global spark; spark.stop(); > >>> SparkSession._instantiatedContext = None > ... > >>> spark = SparkSession.builder.getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/06/28 11:41:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/06/28 11:41:36 WARN Utils: Your hostname, vlad-databricks resolves to a > loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface > enp0s31f6) > 16/06/28 11:41:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'true' > >>> reset_spark() > >>> spark = SparkSession.builder.config("spark.sql.retainGroupColumns", > >>> "false").getOrCreate() > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'false' > >>> reset_spark() > >>> spark = SparkSession.builder.getOrCreate() > >>> spark.conf.get("spark.sql.retainGroupColumns") > u'false' > >>> > {code} > The last line should output {{u'true'}} instead - there is absolutely no > expectation for global config state to persist across sessions, which should > use default configuration unless deviated from in each session's specific > builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16262) Impossible to remake new SparkContext using SparkSession API in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353659#comment-15353659 ] Vladimir Feinberg commented on SPARK-16262: --- Sure, I think we're agreeing. > Impossible to remake new SparkContext using SparkSession API in Pyspark > --- > > Key: SPARK-16262 > URL: https://issues.apache.org/jira/browse/SPARK-16262 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Vladimir Feinberg >Priority: Minor > > There are multiple use cases where one might like to be able to stop and > re-start a {{SparkSession}}: configuration changes or modular testing. The > following code demonstrates that without clearing a hidden global > {{SparkSession._instantiatedContext = None}} it is impossible to re-create a > new Spark session after stopping one in the same process: > {code} > >>> from pyspark.sql import SparkSession > >>> spark = SparkSession.builder.getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/06/28 11:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/06/28 11:28:10 WARN Utils: Your hostname, vlad-databricks resolves to a > loopback address: 127.0.1.1; using 192.168.3.166 instead (on interface > enp0s31f6) > 16/06/28 11:28:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > >>> spark.stop() > >>> spark = SparkSession.builder.getOrCreate() > >>> spark.createDataFrame([(1,)]) > Traceback (most recent call last): > File "", line 1, in > File "pyspark/sql/session.py", line 514, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "pyspark/sql/session.py", line 394, in _createFromLocal > return self._sc.parallelize(data), schema > File "pyspark/context.py", line 410, in parallelize > numSlices = int(numSlices) if numSlices is not None else > self.defaultParallelism > File "pyspark/context.py", line 346, in defaultParallelism > return self._jsc.sc().defaultParallelism() > AttributeError: 'NoneType' object has no attribute 'sc' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357868#comment-15357868 ] Vladimir Feinberg commented on SPARK-4240: -- [~sethah] Hi Seth, it seems like your comment is outdated now that GBT is indeed in ML. Are you currently working on this? > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362721#comment-15362721 ] Vladimir Feinberg commented on SPARK-4240: -- Sorry for delay in response - I was on vacation for the long weekend. Would you mind pushing or linking what you have done so far? I'll get back to you tomorrow on whether I have the bandwidth to tackle this right now. > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15575) Remove breeze from dependencies?
[ https://issues.apache.org/jira/browse/SPARK-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450836#comment-15450836 ] Vladimir Feinberg commented on SPARK-15575: --- Some of the biggest issues with Breeze perf I've experienced is that a lot of operations you'd expect it to be fast for are not; and it's pretty syntax and heavy use of implicits makes it easy to accidentally use this. For instance: 1. Mixed dense/sparse operations frequently resort to a generic implementation in breeze that uses its Scala iterators. 2. Creation of vectors, under certain operations, will result in unnecessary boxing of doubles (and integers, for sparse vectors). 3. Slice vectors have no support for efficient operations. They are implemented in breeze in a way that makes them no better than Array[Double], which again makes us use Scala iterators whenever we want a fast, vectorized dot product, for instance. Usability is tough sometimes. Even though a Vector[Double] interface seems flexible, a lot of implementations require an explicit knowledge of the vector type (Sparse/dense), else breeze silently uses the slow Scala interface. Heavy use of implicits is also a problem here, since they're not implemented for all permutations of vector types. It's also easy to do, e.g. val `vec1 += vec2 * a * b`. This will create two intermediate vectors. I think the biggest issue is that `ml.linalg.Vector` is Breeze-backed. We should use our own linear algebra (we do have `BLAS`, though to support slicing this interface would have to be expanded) and move around `ArrayView[Double]` inside the vector instead. Breeze as a dependency, as mentioned below, is very useful still for optimization. I think we can keep it around for that, as long as it's only for that. > Remove breeze from dependencies? > > > Key: SPARK-15575 > URL: https://issues.apache.org/jira/browse/SPARK-15575 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for discussing whether we should remove Breeze from the > dependencies of MLlib. The main issues with Breeze are Scala 2.12 support > and performance issues. > There are a few paths: > # Keep dependency. This could be OK, especially if the Scala version issues > are fixed within Breeze. > # Remove dependency > ## Implement our own linear algebra operators as needed > ## Design a way to build Spark using custom linalg libraries of the user's > choice. E.g., you could build MLlib using Breeze, or any other library > supporting the required operations. This might require significant work. > See [SPARK-6442] for related discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16728: -- Description: Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this: 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based splits for complex loss functions). 2. The old impurity API only lets you use summary statistics up to the 2nd order. These are useless for several impurity measures and inadequate for others (e.g., absolute loss or huber loss). It needs some renovation. 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and Impurity into a single class (and use virtual calls rather than case statements when toggling over impurity types). was: Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this: 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based splits for complex loss functions). 2. The old impurity API only lets you use summary statistics up to the 2nd order. These are useless for several impurity measures and inadequate for others (e.g., absolute loss or huber loss). It needs some renovation. > migrate internal API for MLlib trees from spark.mllib to spark.ml > - > > Key: SPARK-16728 > URL: https://issues.apache.org/jira/browse/SPARK-16728 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg > > Currently, spark.ml trees rely on spark.mllib implementations. There are two > issues with this: > 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information > (the previous ensemble's prediction, for instance) inside the TreePoints > (this is necessary to have loss-based splits for complex loss functions). > 2. The old impurity API only lets you use summary statistics up to the 2nd > order. These are useless for several impurity measures and inadequate for > others (e.g., absolute loss or huber loss). It needs some renovation. > 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, > and Impurity into a single class (and use virtual calls rather than case > statements when toggling over impurity types). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org