[jira] [Updated] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9719: - Shepherd: Joseph K. Bradley spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Feynman Liang Priority: Minor spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9719: - Assignee: Feynman Liang spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Feynman Liang Priority: Minor spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8890) Reduce memory consumption for dynamic partition insert
[ https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8890. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8010 [https://github.com/apache/spark/pull/8010] Reduce memory consumption for dynamic partition insert -- Key: SPARK-8890 URL: https://issues.apache.org/jira/browse/SPARK-8890 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Critical Fix For: 1.5.0 Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM. The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8160) Tungsten style external aggregation
[ https://issues.apache.org/jira/browse/SPARK-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8160. Resolution: Fixed Tungsten style external aggregation --- Key: SPARK-8160 URL: https://issues.apache.org/jira/browse/SPARK-8160 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Yin Huai Fix For: 1.5.0 Support using external sorting to run aggregate so we can easily process aggregates where each partition is much larger than memory size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9670) ML 1.5 QA: Examples: Check for new APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9670: - Assignee: Ram Sriharsha ML 1.5 QA: Examples: Check for new APIs requiring example code -- Key: SPARK-9670 URL: https://issues.apache.org/jira/browse/SPARK-9670 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Joseph K. Bradley Assignee: Ram Sriharsha Priority: Minor Audit list of new features added to MLlib, and see which major items are missing example code (in the examples folder). We do not need examples for everything, only for major items such as new ML algorithms. For any such items: * Create a JIRA for that feature, and assign it to the author of the feature (or yourself if interested). * Link it to (a) the original JIRA which introduced that feature (related to) and (b) to this JIRA (requires). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9756) Make auxillary constructors for ML decision trees private
[ https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9756: - Fix Version/s: (was: 1.5.0) Make auxillary constructors for ML decision trees private - Key: SPARK-9756 URL: https://issues.apache.org/jira/browse/SPARK-9756 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor These classes should not (and actually can not) be instantiated directly because there is currently no public constructor for {{Node}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9756) Make auxillary constructors for ML decision trees private
[ https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9756: - Shepherd: Joseph K. Bradley Assignee: Feynman Liang Target Version/s: 1.5.0 Make auxillary constructors for ML decision trees private - Key: SPARK-9756 URL: https://issues.apache.org/jira/browse/SPARK-9756 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor These classes should not (and actually can not) be instantiated directly because there is currently no public constructor for {{Node}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9756) Make auxillary constructors for ML decision trees private
[ https://issues.apache.org/jira/browse/SPARK-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9756. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8046 [https://github.com/apache/spark/pull/8046] Make auxillary constructors for ML decision trees private - Key: SPARK-9756 URL: https://issues.apache.org/jira/browse/SPARK-9756 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 These classes should not (and actually can not) be instantiated directly because there is currently no public constructor for {{Node}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9719. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8047 [https://github.com/apache/spark/pull/8047] spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Feynman Liang Priority: Minor Fix For: 1.5.0 spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9066) Improve cartesian performance
[ https://issues.apache.org/jira/browse/SPARK-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662707#comment-14662707 ] Weizhong commented on SPARK-9066: - Yes, the root reaason is same, that is cause by scan HDFS too many times, in [PR#6454|https://github.com/apache/spark/pull/6454] use coalesce to decrease partitions, but add two shuffles, but if we change the cartesian order also can decrease the scan times, which I have done in [PR#7417|https://github.com/apache/spark/pull/7417] Improve cartesian performance -- Key: SPARK-9066 URL: https://issues.apache.org/jira/browse/SPARK-9066 Project: Spark Issue Type: Improvement Components: SQL Reporter: Weizhong Priority: Minor Currently, for CartesianProduct, if right plan partition record number are small than left partition record number, then the performance is bad as need do many times scan for right plan. For example: {noformat} with single_value as ( select max(1) tpcds_val from date_dim ) select sum(ss_quantity * ss_sales_price) ssales, tpcds_val from store_sales, single_value group by tpcds_val {noformat} above SQL clause, right plan only have 1 record, left plan have 1823 partiton(in our test) and each partition has more than 4000 records, then for each left plan partition record we need scan data from hdfs for right plan. That is, for left plan we need scan _left_plan_partition_num_ times, for right plan we need scan _left_plan_partition_num * right_plan_partition_num_ times, total is _left_plan_partition_num * (1 + right_plan_partition_num)_ times -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9754) Remove TypeCheck in debug package
[ https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9754. Resolution: Fixed Fix Version/s: 1.5.0 Remove TypeCheck in debug package - Key: SPARK-9754 URL: https://issues.apache.org/jira/browse/SPARK-9754 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 TypeCheck no longer applies in the new Tungsten world. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9666) ML 1.5 QA: model save/load audit
[ https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9666: - Assignee: yuhao yang ML 1.5 QA: model save/load audit Key: SPARK-9666 URL: https://issues.apache.org/jira/browse/SPARK-9666 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: yuhao yang We should check to make sure no changes broke model import/export in spark.mllib. * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe
[ https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-9738: --- Description: In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. remove FromUnsafe and add its codegen version to GenerateSafe - Key: SPARK-9738 URL: https://issues.apache.org/jira/browse/SPARK-9738 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe
[ https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662763#comment-14662763 ] Wenchen Fan commented on SPARK-9738: [~joshrosen] sorry about the rush, I've filled in the description now :) remove FromUnsafe and add its codegen version to GenerateSafe - Key: SPARK-9738 URL: https://issues.apache.org/jira/browse/SPARK-9738 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9753. Resolution: Fixed Fix Version/s: 1.5.0 TungstenAggregate should also accept InternalRow instead of just UnsafeRow -- Key: SPARK-9753 URL: https://issues.apache.org/jira/browse/SPARK-9753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 Since we need to project out key and value out, there is no need to only accept UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9748) Centriod typo in KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9748. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8037 [https://github.com/apache/spark/pull/8037] Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Assignee: Bertrand Dechoux Priority: Trivial Labels: typo Fix For: 1.6.0 A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9748) Centriod typo in KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9748: - Target Version/s: 1.6.0 Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Assignee: Bertrand Dechoux Priority: Trivial Labels: typo Fix For: 1.6.0 A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9748) Centriod typo in KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9748: - Assignee: Bertrand Dechoux Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Assignee: Bertrand Dechoux Priority: Trivial Labels: typo A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9744) Add RDD method to map with lag and lead
[ https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662298#comment-14662298 ] Jerry Z commented on SPARK-9744: Fixed! Sorry didn't know you were referring to the title. So I think for performance sake, this would be a handy feature to have also saves me a lot of typing and avoiding my code wrapping around. On a semi-related note, why does cogroup need an iterator of the class? join() doesn't. Add RDD method to map with lag and lead --- Key: SPARK-9744 URL: https://issues.apache.org/jira/browse/SPARK-9744 Project: Spark Issue Type: Wish Reporter: Jerry Z Priority: Minor To avoid zipping with index and doing numerous mapping and joins, having a single method call to map with an additional two parameters (1: list of offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default value). The other difference to the map function takes an argument of ListT and not just T. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9744) Add RDD method to map with lag and lead
[ https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Z updated SPARK-9744: --- Comment: was deleted (was: Fixed! Sorry didn't know you were referring to the title. So I think for performance sake, this would be a handy feature to have also saves me a lot of typing and avoiding my code wrapping around. On a semi-related note, why does cogroup need an iterator of the class? join() doesn't.) Add RDD method to map with lag and lead --- Key: SPARK-9744 URL: https://issues.apache.org/jira/browse/SPARK-9744 Project: Spark Issue Type: Wish Reporter: Jerry Z Priority: Minor To avoid zipping with index and doing numerous mapping and joins, having a single method call to map with an additional two parameters (1: list of offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default value). The other difference to the map function takes an argument of ListT and not just T. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662303#comment-14662303 ] Feynman Liang edited comment on SPARK-9660 at 8/7/15 7:23 PM: -- Logistic regression [only supports binary classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83], but various [scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50] assert that this is a backwards compatibility feature, suggesting that multiclass is supported. This is made more confusing by the fact that inheriting from {{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public method, potentially allowing a user to set more than two thresholds on a binary classifier... It may make sense to consider adding {{numClasses}} to {{ClassifierParams}} and explicitly check that in {{HasThresholds}} (self type annotation?) was (Author: fliang): Logistic regression [only supports binary classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83], but various [scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50] assert that this is a backwards compatibility feature, suggesting that multiclass is supported. This is made more confusing by the fact that inheriting from {{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public method, potentially allowing a user to set more than two thresholds on a binary classifier... It may make sense to consider adding {{numClasses}} to {{ClassifierParams}} and explicitly check that when setting thresholds. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9744) Add RDD method to map with lag and lead
[ https://issues.apache.org/jira/browse/SPARK-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Z updated SPARK-9744: --- Comment: was deleted (was: Fixed! Sorry didn't know you were referring to the title. So I think for performance sake, this would be a handy feature to have also saves me a lot of typing and avoiding my code wrapping around. On a semi-related note, why does cogroup need an iterator of the class? join() doesn't.) Add RDD method to map with lag and lead --- Key: SPARK-9744 URL: https://issues.apache.org/jira/browse/SPARK-9744 Project: Spark Issue Type: Wish Reporter: Jerry Z Priority: Minor To avoid zipping with index and doing numerous mapping and joins, having a single method call to map with an additional two parameters (1: list of offsets [(-) for lag, 0 for current and (+) for lead])) and (2:default value). The other difference to the map function takes an argument of ListT and not just T. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662303#comment-14662303 ] Feynman Liang commented on SPARK-9660: -- Logistic regression [only supports binary classification|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L83], but various [scaladocs|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L50] assert that this is a backwards compatibility feature, suggesting that multiclass is supported. This is made more confusing by the fact that inheriting from {{ProbabilisticClassifier}} exposes a {{setThresholds(Array[Double])}} public method, potentially allowing a user to set more than two thresholds on a binary classifier... It may make sense to consider adding {{numClasses}} to {{ClassifierParams}} and explicitly check that when setting thresholds. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317 ] Feynman Liang edited comment on SPARK-9660 at 8/7/15 7:58 PM: -- Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? Ditto for [DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110] was (Author: fliang): Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 7:59 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662360#comment-14662360 ] Andreas commented on SPARK-9746: May be I'm to dumb: but the count for each key is always '1' PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662367#comment-14662367 ] Feynman Liang commented on SPARK-9660: -- {{LogisticRegressionModel.toString()}} missing short description. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9749) DenseMatrix equals does not account for isTransposed
Feynman Liang created SPARK-9749: Summary: DenseMatrix equals does not account for isTransposed Key: SPARK-9749 URL: https://issues.apache.org/jira/browse/SPARK-9749 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker A matrix is not always equal to its transpose, but the current implementation of {{equals}} in [DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261] does not account for the {{isTransposed}} flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9740) first/last aggregate NULL behavior
Herman van Hovell created SPARK-9740: Summary: first/last aggregate NULL behavior Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.6.0 Reporter: Herman van Hovell Priority: Minor The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damian Guy updated SPARK-9340: -- Affects Version/s: 1.3.0 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661830#comment-14661830 ] Apache Spark commented on SPARK-9340: - User 'dguy' has created a pull request for this issue: https://github.com/apache/spark/pull/8032 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661831#comment-14661831 ] Damian Guy commented on SPARK-9340: --- I created a pull request against the 1.3 branch (closest to what i am using) https://github.com/apache/spark/pull/8032 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9340: --- Assignee: (was: Apache Spark) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9340: --- Assignee: Apache Spark ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Assignee: Apache Spark Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661859#comment-14661859 ] Herman van Hovell commented on SPARK-9740: -- BTW: I encountered this while doing tests for SPARK-8641. Unfortunately it is kind of a PITA to create a proper test using an Aggregate, they do not enforce sorting, so the result of FIRST/LAST is undeterministic. first/last aggregate NULL behavior -- Key: SPARK-9740 URL: https://issues.apache.org/jira/browse/SPARK-9740 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.6.0 Reporter: Herman van Hovell Priority: Minor The FIRST/LAST aggregates implemented as part of the new UDAF interface, return the first or last non-null value (if any) found. This is a departure from the behavior of the old FIRST/LAST aggregates and from the FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' this behavior for the old UDAF interface. Hive makes this behavior configurable, by adding a skipNulls flag. I would suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317 ] Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:04 PM: -- Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? Ditto for [DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110], [RandomForestRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110] was (Author: fliang): Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? Ditto for [DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110] ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662317#comment-14662317 ] Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:04 PM: -- Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? Ditto for [DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110], [RandomForestRegressionModel|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala#L128] was (Author: fliang): Should [RandomForestClassificationModel's aux constructor|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L139] be private? Ditto for [DecisionTreeRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110], [RandomForestRegressionModel|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L110] ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662341#comment-14662341 ] Andreas edited comment on SPARK-9746 at 8/7/15 8:04 PM: Sorry, but I don't agree. cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey() returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)' If the values (counts) are irrelevant then why this function is called *count*ByKey and why does it return a Map instead of a Set? The current implementation has no added value compared to 'pairRDD.keys.collect().toSet' was (Author: agrothe1): Sorry, but I don't agree. cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey() returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)' If the values (counts) are irrelevant then why this function is called *count*ByKey and why does it return a Map instead of a Set? The current implementation has no added value compared to 'pairRDD.keys.collect().toSet' cntxt.paralize PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662356#comment-14662356 ] Joseph K. Bradley commented on SPARK-9660: -- Sure, sounds good. (same as for DTClassificationModel) ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662358#comment-14662358 ] Joseph K. Bradley commented on SPARK-9660: -- I want to add it as public for all PredictionModel types eventually, so I don't see harm in leaving it public. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662378#comment-14662378 ] Feynman Liang commented on SPARK-9660: -- [~josephkb] Don't users have to provide {{thresholds}} when configuring the model, which would require knowing number of classes before training? ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:41 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:53 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Could you, or a a committer, validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9476) Kafka stream loses leader after 2h of operation
[ https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662446#comment-14662446 ] Ruben Ramalho commented on SPARK-9476: -- Sorry for the late reply, I promise to keep my response delay much smaller from now on. There aren't any error logs, but this problem compromises the normal operation of analytics server. Yes, simpler jobs do run in the same environment. This same setup manages to run correctly for two hours, it's after 2h of operation that this problem arises, which is strange. Unfortunately I cannot share the relevant code, at least as an integral part, but I can share with you what I am doing. I am consuming data from apache kafka, as positional updates, doing window operations over this data and extracting features. This features are then feed to machine learning algorithms and tips are generated and feed back to kafka (a different topic). If you want specific parts of the code I can provide you with that! I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in hopes of this problem being fixed), the issue persists. I think apache spark at some point is corrupting the apache kafka topics, I cannot isolate why that is happening tough. I have used both the kafka direct stream and regular stream and the problem seems to persist. Thanks you, R. Ramalho Kafka stream loses leader after 2h of operation Key: SPARK-9476 URL: https://issues.apache.org/jira/browse/SPARK-9476 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Environment: Docker, Centos, Spark standalone, core i7, 8Gb Reporter: Ruben Ramalho This seems to happen every 2h, it happens both with the direct stream and regular stream, I'm doing window operations over a 1h period (if that can help). Here's part of the error message: 2015-07-30 13:27:23 WARN ClientUtils$:89 - Fetching topic metadata with correlation id 10 for topics [Set(updates)] from broker [id:0,host:192.168.3.23,port:3000] failed java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) 2015-07-30 13:27:23 INFO SyncProducer:68 - Disconnecting from 192.168.3.23:3000 2015-07-30 13:27:23 WARN ConsumerFetcherManager$LeaderFinderThread:89 - [spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], Failed to find leader for Set([updates,0]) kafka.common.KafkaException: fetching topic metadata for topics [Set(oversight-updates)] from broker [ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) Caused by: java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) After the crash I tried to communicate with kafka with a simple scala consumer and producer and have no problem at all. Spark tough needs a kafka container restart to start normal operaiton. There are no errors on the kafka log, apart from an improper closed connection. I have been trying to solve this problem for days, I suspect this has something to do with spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662440#comment-14662440 ] Sean Owen commented on SPARK-9746: -- RDDs are not maps. An RDD of (K,V) is merely a collection of (K,V). K is not unique. Otherwise what would countByKey mean? if K were unique, then all of the counts would be 1 and this method would make no sense. PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662442#comment-14662442 ] Feynman Liang commented on SPARK-9660: -- {{GradientDescent$.runMiniBatchSGD}} should either use default argument or specify the default convergence tolerance in the [method overload|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L267] ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662455#comment-14662455 ] Feynman Liang commented on SPARK-9660: -- Most documentation in [MultivariateOnlineSummarizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L224] was lost and should be readded. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9677) Enable SQLQuerySuite.aggregation with codegen updates peak execution memory
[ https://issues.apache.org/jira/browse/SPARK-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662494#comment-14662494 ] Andrew Or commented on SPARK-9677: -- Resolved by https://github.com/apache/spark/pull/8015 Enable SQLQuerySuite.aggregation with codegen updates peak execution memory - Key: SPARK-9677 URL: https://issues.apache.org/jira/browse/SPARK-9677 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Andrew Or Priority: Blocker Fix For: 1.5.0 It was disabled in https://github.com/apache/spark/pull/7983 Looked like the test case was written against the old aggregate. We need to rewrite it to work for the new aggregate (and make sure the memory usage reporting works for the new aggregate). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9677) Enable SQLQuerySuite.aggregation with codegen updates peak execution memory
[ https://issues.apache.org/jira/browse/SPARK-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-9677. Resolution: Fixed Fix Version/s: 1.5.0 Enable SQLQuerySuite.aggregation with codegen updates peak execution memory - Key: SPARK-9677 URL: https://issues.apache.org/jira/browse/SPARK-9677 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Andrew Or Priority: Blocker Fix For: 1.5.0 It was disabled in https://github.com/apache/spark/pull/7983 Looked like the test case was written against the old aggregate. We need to rewrite it to work for the new aggregate (and make sure the memory usage reporting works for the new aggregate). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8481) GaussianMixtureModel predict accepting single vector
[ https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-8481: -- Reopening before merging version fix PR GaussianMixtureModel predict accepting single vector Key: SPARK-8481 URL: https://issues.apache.org/jira/browse/SPARK-8481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Dariusz Kobylarz Assignee: Dariusz Kobylarz Priority: Minor Labels: GaussianMixtureModel, MLlib Fix For: 1.5.0 Original Estimate: 24h Remaining Estimate: 24h GaussianMixtureModel lacks a method to predict a cluster for a single input vector where no spark context would be involved, i.e. /** Maps given point to its cluster index. */ def predict(point: Vector) : Int -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662522#comment-14662522 ] Andreas commented on SPARK-9746: Many thanks for your responsiveness and patience. I admire your contribution to this awesome project. BR from a very thankful user. PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9745) Applications hangs when the last executor fails with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9745: - Priority: Blocker (was: Critical) Applications hangs when the last executor fails with dynamic allocation --- Key: SPARK-9745 URL: https://issues.apache.org/jira/browse/SPARK-9745 Project: Spark Issue Type: Bug Components: PySpark, Scheduler, YARN Affects Versions: 1.5.0 Environment: YARN + Pyspark + Dynamic Allocation Reporter: Alex Angelini Assignee: Andrew Or Priority: Blocker Attachments: am_hung_job.png, executors_hung_job.png, logs_hung_job.png, tasks_hung_job.png When a job has only a single executor remaining and that executor dies (due to something like an OOM), the application fails to notice that there are no executors left and it hangs indefinitely. This only happens when dynamic allocation is enabled. The following images were taken from a hung application with no executors: !logs_hung_job.png! ^^ *Notice how 1 executor was lost, but the application never requested it to be removed* !am_hung_job.png! !executors_hung_job.png! !tasks_hung_job.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9375) The total number of executor(s) requested by the driver may be negative
[ https://issues.apache.org/jira/browse/SPARK-9375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9375: - Priority: Critical (was: Major) The total number of executor(s) requested by the driver may be negative - Key: SPARK-9375 URL: https://issues.apache.org/jira/browse/SPARK-9375 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: KaiXinXIaoLei Priority: Critical Attachments: The total number of executor(s) is negative in AM log.png I set spark.dynamicAllocation.enabled = true”. I run a big job. I find a problem in ApplicationMaster log: the total number of executor(s) requested by the driver is negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9754) Remove TypeCheck in debug package
[ https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9754: --- Assignee: Apache Spark (was: Reynold Xin) Remove TypeCheck in debug package - Key: SPARK-9754 URL: https://issues.apache.org/jira/browse/SPARK-9754 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark TypeCheck no longer applies in the new Tungsten world. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662538#comment-14662538 ] Joseph K. Bradley commented on SPARK-9750: -- [~fliang] Are you working on this? SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662536#comment-14662536 ] Apache Spark commented on SPARK-9750: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8042 SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9750: --- Assignee: Apache Spark SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Reporter: Feynman Liang Assignee: Apache Spark Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9750: --- Assignee: (was: Apache Spark) SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9754) Remove TypeCheck in debug package
Reynold Xin created SPARK-9754: -- Summary: Remove TypeCheck in debug package Key: SPARK-9754 URL: https://issues.apache.org/jira/browse/SPARK-9754 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin TypeCheck no longer applies in the new Tungsten world. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9568) Spark MLlib 1.5.0 testing umbrella
[ https://issues.apache.org/jira/browse/SPARK-9568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9568: - Description: h2. API * Check binary API compatibility (SPARK-9658) * Audit new public APIs (from the generated html doc) ** Scala (SPARK-9660) ** Java compatibility (SPARK-9661) ** Python coverage (SPARK-9662) * Check Experimental, DeveloperApi tags (SPARK-9665) h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * model save/load (SPARK-9666) h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide (SPARK-9668) * For major components, create JIRAs for example code (SPARK-9670) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671) was: h2. API * Check binary API compatibility * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * model save/load (SPARK-9666) h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide (SPARK-9668) * For major components, create JIRAs for example code (SPARK-9670) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671) Spark MLlib 1.5.0 testing umbrella -- Key: SPARK-9568 URL: https://issues.apache.org/jira/browse/SPARK-9568 Project: Spark Issue Type: Umbrella Components: MLlib Reporter: Reynold Xin Assignee: Xiangrui Meng h2. API * Check binary API compatibility (SPARK-9658) * Audit new public APIs (from the generated html doc) ** Scala (SPARK-9660) ** Java compatibility (SPARK-9661) ** Python coverage (SPARK-9662) * Check Experimental, DeveloperApi tags (SPARK-9665) h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * model save/load (SPARK-9666) h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide (SPARK-9668) * For major components, create JIRAs for example code (SPARK-9670) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-9671) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer
Feynman Liang created SPARK-9755: Summary: Add method documentation to MultivariateOnlineSummarizer Key: SPARK-9755 URL: https://issues.apache.org/jira/browse/SPARK-9755 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Priority: Minor Docs present in 1.4 are lost in current 1.5 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662569#comment-14662569 ] Feynman Liang commented on SPARK-9755: -- Working on this. Add method documentation to MultivariateOnlineSummarizer Key: SPARK-9755 URL: https://issues.apache.org/jira/browse/SPARK-9755 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Priority: Minor Docs present in 1.4 are lost in current 1.5 branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9714) Cannot insert into a table using pySpark
[ https://issues.apache.org/jira/browse/SPARK-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9714: Description: This is a bug on the master branch. After creating the table (yun is the table name) with the corresponding fields, I ran the following command. from pyspark.sql import * sc.parallelize([Row(id=1, name=test, description=)]).toDF().write.mode(append).saveAsTable(yun) I get the following error: Py4JJavaError: An error occurred while calling o100.saveAsTable. : org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path Serialization stack: - object not serializable (class: org.apache.hadoop.fs.Path, value: /user/hive/warehouse/yun) - field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, type: class org.apache.hadoop.fs.Path) - object (class org.apache.hadoop.hive.ql.metadata.Table, yun) - field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: table, type: class org.apache.hadoop.hive.ql.metadata.Table) - object (class org.apache.hadoop.hive.ql.metadata.Partition, yun()) - field (class: scala.collection.immutable.Stream$Cons, name: hd, type: class java.lang.Object) - object (class scala.collection.immutable.Stream$Cons, Stream(yun())) - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, function0) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream(HivePartition(List(),HiveStorageDescriptor(/user/hive/warehouse/yun,org.apache.hadoop.mapred.TextInputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,Map(serialization.format - 1) - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream) - object (class scala.collection.immutable.Stream$$anonfun$map$1, function0) - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0) - object (class scala.collection.immutable.Stream$Cons, Stream(/user/hive/warehouse/yun)) - field (class: org.apache.spark.sql.hive.MetastoreRelation, name: paths, type: interface scala.collection.Seq) - object (class org.apache.spark.sql.hive.MetastoreRelation, MetastoreRelation default, yun, None ) - field (class: org.apache.spark.sql.hive.execution.InsertIntoHiveTable, name: table, type: class org.apache.spark.sql.hive.MetastoreRelation) - object (class org.apache.spark.sql.hive.execution.InsertIntoHiveTable, InsertIntoHiveTable (MetastoreRelation default, yun, None), Map(), false, false ConvertToSafe TungstenProject [CAST(description#10, FloatType) AS description#16,CAST(id#11L, StringType) AS id#17,name#12] PhysicalRDD [description#10,id#11L,name#12], MapPartitionsRDD[17] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2 ) - field (class: org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3, name: $outer, type: class org.apache.spark.sql.hive.execution.InsertIntoHiveTable) - object (class org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3, function2) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) ... 30 more was: This is a bug on the master branch. After creating the table (yun is the table name) with the corresponding fields, I ran the following command. from pyspark.sql import * sc.parallelize([Row(id=1, name=test, description=)]).toDF().write.mode(append).saveAsTable(yun) I get the following error: Py4JJavaError: An error occurred while calling o100.saveAsTable. : org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path Serialization stack: - object not serializable (class: org.apache.hadoop.fs.Path, value: dbfs:/user/hive/warehouse/yun) - field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, type: class org.apache.hadoop.fs.Path) - object (class org.apache.hadoop.hive.ql.metadata.Table, yun) - field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: table, type: class org.apache.hadoop.hive.ql.metadata.Table)
[jira] [Created] (SPARK-9756) Make auxillary constructors for ML decision trees private
Feynman Liang created SPARK-9756: Summary: Make auxillary constructors for ML decision trees private Key: SPARK-9756 URL: https://issues.apache.org/jira/browse/SPARK-9756 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Priority: Minor Fix For: 1.5.0 These classes should not (and actually can not) be instantiated directly because there is currently no public constructor for {{Node}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow
Reynold Xin created SPARK-9752: -- Summary: Sample operator should avoid row copying and support UnsafeRow Key: SPARK-9752 URL: https://issues.apache.org/jira/browse/SPARK-9752 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9751) Audit operators to make sure they can support UnsafeRows
Reynold Xin created SPARK-9751: -- Summary: Audit operators to make sure they can support UnsafeRows Key: SPARK-9751 URL: https://issues.apache.org/jira/browse/SPARK-9751 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin An umbrella ticket to track various operators that should be able to support UnsafeRow to avoid copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662510#comment-14662510 ] Andreas commented on SPARK-9746: Still the scaladoc Count the number of elements for each key, collecting the results to a local Map for me is misleading. Maybe it should read Count the number of (distinct ? or whatever) keys. For whatever purpose this is needed. PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662514#comment-14662514 ] Joseph K. Bradley commented on SPARK-9720: -- I like the proposal, but I don't think we should break APIs...which unfortunately means we will need to stick with encouragement instead of enforcement. Would you mind sending a PR to update those classes with issues? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9753: --- Assignee: Yin Huai (was: Apache Spark) TungstenAggregate should also accept InternalRow instead of just UnsafeRow -- Key: SPARK-9753 URL: https://issues.apache.org/jira/browse/SPARK-9753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Since we need to project out key and value out, there is no need to only accept UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9753: --- Assignee: Apache Spark (was: Yin Huai) TungstenAggregate should also accept InternalRow instead of just UnsafeRow -- Key: SPARK-9753 URL: https://issues.apache.org/jira/browse/SPARK-9753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker Since we need to project out key and value out, there is no need to only accept UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662523#comment-14662523 ] Sean Owen commented on SPARK-9746: -- It does not count the number of distinct keys, nor does it count distinct values for the key, so I don't think that's accurate. It counts the number of times each key appears. I suppose there are many ways of saying this; here it says it counts the number of elements that include each key, which seems like a reasonable description of the behavior. PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662534#comment-14662534 ] Joseph K. Bradley commented on SPARK-7454: -- [~javadba] I should have pinged you before, but could you please send a PR for that perf-test? Thank you! Perf test for power iteration clustering (PIC) -- Key: SPARK-7454 URL: https://issues.apache.org/jira/browse/SPARK-7454 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Stephen Boesch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9754) Remove TypeCheck in debug package
[ https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662540#comment-14662540 ] Apache Spark commented on SPARK-9754: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8043 Remove TypeCheck in debug package - Key: SPARK-9754 URL: https://issues.apache.org/jira/browse/SPARK-9754 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin TypeCheck no longer applies in the new Tungsten world. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9754) Remove TypeCheck in debug package
[ https://issues.apache.org/jira/browse/SPARK-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9754: --- Assignee: Reynold Xin (was: Apache Spark) Remove TypeCheck in debug package - Key: SPARK-9754 URL: https://issues.apache.org/jira/browse/SPARK-9754 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin TypeCheck no longer applies in the new Tungsten world. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9750) SparseMatrix should override equals
[ https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662566#comment-14662566 ] Feynman Liang commented on SPARK-9750: -- Yep. SparseMatrix should override equals --- Key: SPARK-9750 URL: https://issues.apache.org/jira/browse/SPARK-9750 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479] should override equals to ensure that two instances of the same matrix are equal. This implementation should take into account the {{isTransposed}} flag and {{values}} may not be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions
[ https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9620: --- Assignee: (was: Apache Spark) generated UnsafeProjection does not support many columns or large exressions Key: SPARK-9620 URL: https://issues.apache.org/jira/browse/SPARK-9620 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Priority: Critical We put all the expressions in one function of UnsafeProjection, that could reach the 65k code size limit in JVM. We should split them into multiple functions, like that we do for MutableProjection and SafeProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions
[ https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9620: --- Assignee: Apache Spark generated UnsafeProjection does not support many columns or large exressions Key: SPARK-9620 URL: https://issues.apache.org/jira/browse/SPARK-9620 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Apache Spark Priority: Critical We put all the expressions in one function of UnsafeProjection, that could reach the 65k code size limit in JVM. We should split them into multiple functions, like that we do for MutableProjection and SafeProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9620) generated UnsafeProjection does not support many columns or large exressions
[ https://issues.apache.org/jira/browse/SPARK-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662579#comment-14662579 ] Apache Spark commented on SPARK-9620: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8044 generated UnsafeProjection does not support many columns or large exressions Key: SPARK-9620 URL: https://issues.apache.org/jira/browse/SPARK-9620 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Priority: Critical We put all the expressions in one function of UnsafeProjection, that could reach the 65k code size limit in JVM. We should split them into multiple functions, like that we do for MutableProjection and SafeProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9757) Can't create persistent data source tables with decimal
Michael Armbrust created SPARK-9757: --- Summary: Can't create persistent data source tables with decimal Key: SPARK-9757 URL: https://issues.apache.org/jira/browse/SPARK-9757 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Michael Armbrust Priority: Blocker {code} Caused by: java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384 at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:102) at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:60) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:358) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:356) at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:351) at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:198) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSource.run(commands.scala:152) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9719: --- Assignee: (was: Apache Spark) spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Priority: Minor spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662616#comment-14662616 ] Apache Spark commented on SPARK-9719: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8047 spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Priority: Minor spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9719) spark.ml NaiveBayes doc cleanups
[ https://issues.apache.org/jira/browse/SPARK-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9719: --- Assignee: Apache Spark spark.ml NaiveBayes doc cleanups Key: SPARK-9719 URL: https://issues.apache.org/jira/browse/SPARK-9719 Project: Spark Issue Type: Documentation Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor spark.ml NaiveBayesModel: Add Scala and Python doc for pi, theta Add setParam tag to NaiveBayes setModelType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas reopened SPARK-9746: Sorry, but I don't agree. cntxt..parallelize(List ((a, 1), (a, 2))).groupBy(_._1).countByKey() returns 'Map(a - 1)' but should in my opinion return 'Map(a - 2)' If the values (counts) are irrelevant then why this function is called *count*ByKey and why does it return a Map instead of a Set? The current implementation has no added value compared to 'pairRDD.keys.collect().toSet' cntxt.paralize PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662364#comment-14662364 ] Feynman Liang commented on SPARK-9660: -- {{LogisticRegressionModel$.load}} missing short description. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662374#comment-14662374 ] Feynman Liang edited comment on SPARK-9660 at 8/7/15 8:26 PM: -- {{SVMModel}} missing short descriptions for {{save}}, {{load}}, and {{toString}} was (Author: fliang): {{SVMModel}} missing short descriptions for {{save}} and {{toString}} ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9738) remove FromUnsafe and add its codegen version to GenerateSafe
[ https://issues.apache.org/jira/browse/SPARK-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662403#comment-14662403 ] Josh Rosen commented on SPARK-9738: --- [~davies], [~cloud_fan], [~rxin], should this JIRA be converted to a subtask or targeted in a Tungsten epic? Can we add a description saying the motivation for this change? remove FromUnsafe and add its codegen version to GenerateSafe - Key: SPARK-9738 URL: https://issues.apache.org/jira/browse/SPARK-9738 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9747) Avoid starving an unsafe operator in an aggregate
[ https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9747: --- Assignee: Andrew Or (was: Apache Spark) Avoid starving an unsafe operator in an aggregate - Key: SPARK-9747 URL: https://issues.apache.org/jira/browse/SPARK-9747 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This mainly concerns TungstenAggregate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9747) Avoid starving an unsafe operator in an aggregate
[ https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9747: --- Assignee: Apache Spark (was: Andrew Or) Avoid starving an unsafe operator in an aggregate - Key: SPARK-9747 URL: https://issues.apache.org/jira/browse/SPARK-9747 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Apache Spark Priority: Blocker This mainly concerns TungstenAggregate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9747) Avoid starving an unsafe operator in an aggregate
[ https://issues.apache.org/jira/browse/SPARK-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662413#comment-14662413 ] Apache Spark commented on SPARK-9747: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8038 Avoid starving an unsafe operator in an aggregate - Key: SPARK-9747 URL: https://issues.apache.org/jira/browse/SPARK-9747 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This mainly concerns TungstenAggregate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9746) PairRDDFunctions.countByKey: values/counts always 1
[ https://issues.apache.org/jira/browse/SPARK-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662430#comment-14662430 ] Andreas commented on SPARK-9746: Sorry to waste your time. But in my understanding in a PairRDD[K,V] each key (K) should occurre only once (its like a Map[K,V]). It's by design that the keys in a map are uique (occurre only one), there is no sense in counting the # of occurences of a key in a Map (always one by design). PairRDDFunctions.countByKey: values/counts always 1 --- Key: SPARK-9746 URL: https://issues.apache.org/jira/browse/SPARK-9746 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andreas org.apache.spark.rdd.PairRDDFunctionscountByKey(): Map[K, Long] = self.withScope { self.mapValues(_ = 1L).reduceByKey(_ + _).collect().toMap } obviously always returns count 1 for each key. If I understand the docs correctly I would expect this implementation: self.mapValues(_.size).reduceByKey(_ + _).collect().toMap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9476) Kafka stream loses leader after 2h of operation
[ https://issues.apache.org/jira/browse/SPARK-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662446#comment-14662446 ] Ruben Ramalho edited comment on SPARK-9476 at 8/7/15 8:58 PM: -- Sorry for the late reply, I promise to keep my response delay much smaller from now on. There aren't any error logs, but this problem compromises the normal operation of the analytics server. Yes, simpler jobs do run in the same environment. This same setup manages to run correctly for two hours, it's after 2h of operation that this problem arises, which is strange. Unfortunately I cannot share the relevant code, at least as an integral part, but I can share with you what I am doing. I am consuming data from apache kafka, as positional updates, doing window operations over this data and extracting features. This features are then feed to machine learning algorithms and tips are generated and feed back to kafka (a different topic). If you want specific parts of the code I can provide you with that! I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in hopes of this problem being fixed), the issue persists. I think apache spark at some point is corrupting the apache kafka topics, I cannot isolate why that is happening tough. I have used both the kafka direct stream and regular stream and the problem seems to persist. Thanks you, R. Ramalho was (Author: r.ramalho): Sorry for the late reply, I promise to keep my response delay much smaller from now on. There aren't any error logs, but this problem compromises the normal operation of analytics server. Yes, simpler jobs do run in the same environment. This same setup manages to run correctly for two hours, it's after 2h of operation that this problem arises, which is strange. Unfortunately I cannot share the relevant code, at least as an integral part, but I can share with you what I am doing. I am consuming data from apache kafka, as positional updates, doing window operations over this data and extracting features. This features are then feed to machine learning algorithms and tips are generated and feed back to kafka (a different topic). If you want specific parts of the code I can provide you with that! I was using apache kafka 0.8.2.0 with this issue then I updated to 0.8.2.1 (in hopes of this problem being fixed), the issue persists. I think apache spark at some point is corrupting the apache kafka topics, I cannot isolate why that is happening tough. I have used both the kafka direct stream and regular stream and the problem seems to persist. Thanks you, R. Ramalho Kafka stream loses leader after 2h of operation Key: SPARK-9476 URL: https://issues.apache.org/jira/browse/SPARK-9476 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Environment: Docker, Centos, Spark standalone, core i7, 8Gb Reporter: Ruben Ramalho This seems to happen every 2h, it happens both with the direct stream and regular stream, I'm doing window operations over a 1h period (if that can help). Here's part of the error message: 2015-07-30 13:27:23 WARN ClientUtils$:89 - Fetching topic metadata with correlation id 10 for topics [Set(updates)] from broker [id:0,host:192.168.3.23,port:3000] failed java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) 2015-07-30 13:27:23 INFO SyncProducer:68 - Disconnecting from 192.168.3.23:3000 2015-07-30 13:27:23 WARN ConsumerFetcherManager$LeaderFinderThread:89 - [spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], Failed to find leader for Set([updates,0]) kafka.common.KafkaException: fetching topic metadata for topics [Set(oversight-updates)] from broker [ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) Caused by:
[jira] [Commented] (SPARK-9660) ML 1.5 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662452#comment-14662452 ] Feynman Liang commented on SPARK-9660: -- StreamingLinearRegressionWithSGD's [setConvergenceTol|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.scala#L88] and [setInitialWeights|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.scala#L83] should document default values. ML 1.5 QA: API: New Scala APIs, docs Key: SPARK-9660 URL: https://issues.apache.org/jira/browse/SPARK-9660 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Audit new public Scala APIs added to MLlib. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please comment here, or better yet create JIRAs and link them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9749) DenseMatrix equals does not account for isTransposed
[ https://issues.apache.org/jira/browse/SPARK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662475#comment-14662475 ] Feynman Liang commented on SPARK-9749: -- Working on this. DenseMatrix equals does not account for isTransposed Key: SPARK-9749 URL: https://issues.apache.org/jira/browse/SPARK-9749 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker A matrix is not always equal to its transpose, but the current implementation of {{equals}} in [DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261] does not account for the {{isTransposed}} flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow
Yin Huai created SPARK-9753: --- Summary: TungstenAggregate should also accept InternalRow instead of just UnsafeRow Key: SPARK-9753 URL: https://issues.apache.org/jira/browse/SPARK-9753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Since we need to project out key and value out, there is no need to only accept UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9752: --- Assignee: Apache Spark (was: Reynold Xin) Sample operator should avoid row copying and support UnsafeRow -- Key: SPARK-9752 URL: https://issues.apache.org/jira/browse/SPARK-9752 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9752: --- Assignee: Reynold Xin (was: Apache Spark) Sample operator should avoid row copying and support UnsafeRow -- Key: SPARK-9752 URL: https://issues.apache.org/jira/browse/SPARK-9752 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9752) Sample operator should avoid row copying and support UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662506#comment-14662506 ] Apache Spark commented on SPARK-9752: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8040 Sample operator should avoid row copying and support UnsafeRow -- Key: SPARK-9752 URL: https://issues.apache.org/jira/browse/SPARK-9752 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9753) TungstenAggregate should also accept InternalRow instead of just UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662513#comment-14662513 ] Apache Spark commented on SPARK-9753: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8041 TungstenAggregate should also accept InternalRow instead of just UnsafeRow -- Key: SPARK-9753 URL: https://issues.apache.org/jira/browse/SPARK-9753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Since we need to project out key and value out, there is no need to only accept UnsafeRows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8481) GaussianMixtureModel predict accepting single vector
[ https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-8481. -- Resolution: Fixed Issue resolved by pull request 8039 [https://github.com/apache/spark/pull/8039] GaussianMixtureModel predict accepting single vector Key: SPARK-8481 URL: https://issues.apache.org/jira/browse/SPARK-8481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Dariusz Kobylarz Assignee: Dariusz Kobylarz Priority: Minor Labels: GaussianMixtureModel, MLlib Fix For: 1.5.0 Original Estimate: 24h Remaining Estimate: 24h GaussianMixtureModel lacks a method to predict a cluster for a single input vector where no spark context would be involved, i.e. /** Maps given point to its cluster index. */ def predict(point: Vector) : Int -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9749) DenseMatrix equals does not account for isTransposed
[ https://issues.apache.org/jira/browse/SPARK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang closed SPARK-9749. Resolution: Not A Problem DenseMatrix equals does not account for isTransposed Key: SPARK-9749 URL: https://issues.apache.org/jira/browse/SPARK-9749 Project: Spark Issue Type: Bug Reporter: Feynman Liang Priority: Blocker A matrix is not always equal to its transpose, but the current implementation of {{equals}} in [DenseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L261] does not account for the {{isTransposed}} flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org