[jira] [Assigned] (SPARK-14627) In TypedAggregateExpression update method we call encoder.shift many times
[ https://issues.apache.org/jira/browse/SPARK-14627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14627: Assignee: Apache Spark > In TypedAggregateExpression update method we call encoder.shift many times > -- > > Key: SPARK-14627 > URL: https://issues.apache.org/jira/browse/SPARK-14627 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > Every time we call TypedAggregateExpression.update method, we call encoder's > shift method. As shift method will copy encoder and underlying > BoundReference, we should prepare the shifted encoder in advance, instead of > calling shift method every time. > BTW, we can also improve encoder's shift method to return itself when shift > delta is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14627) In TypedAggregateExpression update method we call encoder.shift many times
[ https://issues.apache.org/jira/browse/SPARK-14627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240712#comment-15240712 ] Apache Spark commented on SPARK-14627: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/12387 > In TypedAggregateExpression update method we call encoder.shift many times > -- > > Key: SPARK-14627 > URL: https://issues.apache.org/jira/browse/SPARK-14627 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh >Priority: Minor > > Every time we call TypedAggregateExpression.update method, we call encoder's > shift method. As shift method will copy encoder and underlying > BoundReference, we should prepare the shifted encoder in advance, instead of > calling shift method every time. > BTW, we can also improve encoder's shift method to return itself when shift > delta is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14627) In TypedAggregateExpression update method we call encoder.shift many times
[ https://issues.apache.org/jira/browse/SPARK-14627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14627: Assignee: (was: Apache Spark) > In TypedAggregateExpression update method we call encoder.shift many times > -- > > Key: SPARK-14627 > URL: https://issues.apache.org/jira/browse/SPARK-14627 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh >Priority: Minor > > Every time we call TypedAggregateExpression.update method, we call encoder's > shift method. As shift method will copy encoder and underlying > BoundReference, we should prepare the shifted encoder in advance, instead of > calling shift method every time. > BTW, we can also improve encoder's shift method to return itself when shift > delta is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14627) In TypedAggregateExpression update method we call encoder.shift many times
Liang-Chi Hsieh created SPARK-14627: --- Summary: In TypedAggregateExpression update method we call encoder.shift many times Key: SPARK-14627 URL: https://issues.apache.org/jira/browse/SPARK-14627 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor Every time we call TypedAggregateExpression.update method, we call encoder's shift method. As shift method will copy encoder and underlying BoundReference, we should prepare the shifted encoder in advance, instead of calling shift method every time. BTW, we can also improve encoder's shift method to return itself when shift delta is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14626) Simplify accumulators and task metrics
Reynold Xin created SPARK-14626: --- Summary: Simplify accumulators and task metrics Key: SPARK-14626 URL: https://issues.apache.org/jira/browse/SPARK-14626 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of the ticket is to simplify both the external interface and the internal implementation for accumulators and metrics. They are unnecessarily convoluted and we should be able to simplify them quite a bit. This is an umbrella ticket and I will iteratively create new tasks as my investigation goes on. However, at a high level, I'd would like to create better abstractions for internal implementations, as well as creating a simplified accumulator v2 external interface that doesn't involve a complex type hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14617) Remove deprecated APIs in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14617: Parent Issue: SPARK-14626 (was: SPARK-10620) > Remove deprecated APIs in TaskMetrics > - > > Key: SPARK-14617 > URL: https://issues.apache.org/jira/browse/SPARK-14617 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14625) TaskUIData and ExecutorUIData shouldn't be case classes
[ https://issues.apache.org/jira/browse/SPARK-14625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14625: Issue Type: Sub-task (was: Improvement) Parent: SPARK-14626 > TaskUIData and ExecutorUIData shouldn't be case classes > --- > > Key: SPARK-14625 > URL: https://issues.apache.org/jira/browse/SPARK-14625 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > I was trying to understand the accumulator and metrics update source code and > these two classes don't really need to be case classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
[ https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14619: Parent Issue: SPARK-14626 (was: SPARK-10620) > Track internal accumulators (metrics) by stage attempt rather than stage > > > Key: SPARK-14619 > URL: https://issues.apache.org/jira/browse/SPARK-14619 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > When there are multiple attempts for a stage, we currently only reset > internal accumulator values if all the tasks are resubmitted. It would make > more sense to reset the accumulator values for each stage attempt. This will > allow us to eventually get rid of the internal flag in the Accumulator class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14625) TaskUIData and ExecutorUIData shouldn't be case classes
[ https://issues.apache.org/jira/browse/SPARK-14625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14625: Assignee: Reynold Xin (was: Apache Spark) > TaskUIData and ExecutorUIData shouldn't be case classes > --- > > Key: SPARK-14625 > URL: https://issues.apache.org/jira/browse/SPARK-14625 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > I was trying to understand the accumulator and metrics update source code and > these two classes don't really need to be case classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14625) TaskUIData and ExecutorUIData shouldn't be case classes
[ https://issues.apache.org/jira/browse/SPARK-14625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14625: Assignee: Apache Spark (was: Reynold Xin) > TaskUIData and ExecutorUIData shouldn't be case classes > --- > > Key: SPARK-14625 > URL: https://issues.apache.org/jira/browse/SPARK-14625 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > I was trying to understand the accumulator and metrics update source code and > these two classes don't really need to be case classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid
[ https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14603: Assignee: (was: Apache Spark) > SessionCatalog needs to check if a metadata operation is valid > -- > > Key: SPARK-14603 > URL: https://issues.apache.org/jira/browse/SPARK-14603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since we cannot really trust if the underlying external catalog can throw > exceptions when there is an invalid metadata operation, let's do it in > SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14625) TaskUIData and ExecutorUIData shouldn't be case classes
Reynold Xin created SPARK-14625: --- Summary: TaskUIData and ExecutorUIData shouldn't be case classes Key: SPARK-14625 URL: https://issues.apache.org/jira/browse/SPARK-14625 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid
[ https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14603: Assignee: Apache Spark > SessionCatalog needs to check if a metadata operation is valid > -- > > Key: SPARK-14603 > URL: https://issues.apache.org/jira/browse/SPARK-14603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > Since we cannot really trust if the underlying external catalog can throw > exceptions when there is an invalid metadata operation, let's do it in > SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid
[ https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240694#comment-15240694 ] Apache Spark commented on SPARK-14603: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/12385 > SessionCatalog needs to check if a metadata operation is valid > -- > > Key: SPARK-14603 > URL: https://issues.apache.org/jira/browse/SPARK-14603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since we cannot really trust if the underlying external catalog can throw > exceptions when there is an invalid metadata operation, let's do it in > SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14608) transformSchema needs better documentation
[ https://issues.apache.org/jira/browse/SPARK-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14608: Assignee: Apache Spark > transformSchema needs better documentation > -- > > Key: SPARK-14608 > URL: https://issues.apache.org/jira/browse/SPARK-14608 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > {{PipelineStage.transformSchema}} currently has minimal documentation. It > should have more to explain it can: > * check schema > * check parameter interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14608) transformSchema needs better documentation
[ https://issues.apache.org/jira/browse/SPARK-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14608: Assignee: (was: Apache Spark) > transformSchema needs better documentation > -- > > Key: SPARK-14608 > URL: https://issues.apache.org/jira/browse/SPARK-14608 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Priority: Minor > > {{PipelineStage.transformSchema}} currently has minimal documentation. It > should have more to explain it can: > * check schema > * check parameter interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14608) transformSchema needs better documentation
[ https://issues.apache.org/jira/browse/SPARK-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240693#comment-15240693 ] Apache Spark commented on SPARK-14608: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/12384 > transformSchema needs better documentation > -- > > Key: SPARK-14608 > URL: https://issues.apache.org/jira/browse/SPARK-14608 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Priority: Minor > > {{PipelineStage.transformSchema}} currently has minimal documentation. It > should have more to explain it can: > * check schema > * check parameter interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240676#comment-15240676 ] wei wu commented on SPARK-14559: I think the client channel maybe close when network is abnormal between this client and server for long running service! When client send the request, Should we check the client isActive? > Netty RPC didn't check channel is active before sending message > --- > > Key: SPARK-14559 > URL: https://issues.apache.org/jira/browse/SPARK-14559 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65 >Reporter: cen yuhai > > I have a long-running service. After running for serveral hours, It throwed > these exceptions. I found that before sending rpc request by calling sendRpc > method in TransportClient, there is no check that whether the channel is > still open or active ? > java.nio.channels.ClosedChannelException > 4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 5635696155204230556 to > bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio. > channels.ClosedChannelException > 4866 java.nio.channels.ClosedChannelException > 4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 7319486003318455703 to > bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio. > channels.ClosedChannelException > 4868 java.nio.channels.ClosedChannelException > 4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9041854451893215954 to > bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio. > channels.ClosedChannelException > 4870 java.nio.channels.ClosedChannelException > 4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 6046473497871624501 to > bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio. > channels.ClosedChannelException > 4872 java.nio.channels.ClosedChannelException > 4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9085605650438705047 to > bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio. > channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import
[ https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14374: Assignee: (was: Apache Spark) > PySpark ml GBTClassifier, Regressor support export/import > - > > Key: SPARK-14374 > URL: https://issues.apache.org/jira/browse/SPARK-14374 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import
[ https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14374: Assignee: Apache Spark > PySpark ml GBTClassifier, Regressor support export/import > - > > Key: SPARK-14374 > URL: https://issues.apache.org/jira/browse/SPARK-14374 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import
[ https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240652#comment-15240652 ] Apache Spark commented on SPARK-14374: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12383 > PySpark ml GBTClassifier, Regressor support export/import > - > > Key: SPARK-14374 > URL: https://issues.apache.org/jira/browse/SPARK-14374 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240640#comment-15240640 ] Cheng Lian commented on SPARK-14463: Should we simply throw an exception when text data source is used together with partitioning? > read.text broken for partitioned tables > --- > > Key: SPARK-14463 > URL: https://issues.apache.org/jira/browse/SPARK-14463 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} > breaks when trying to load a partitioned table (or any table where the path > looks partitioned) > {code} > Seq((1, "test")) > .toDF("a", "b") > .write > .format("text") > .partitionBy("a") > .save("/home/michael/text-part-bug") > sqlContext.read.text("/home/michael/text-part-bug") > {code} > {code} > org.apache.spark.sql.AnalysisException: Try to map struct > to Tuple1, but failed as the number of fields does not line up. > - Input schema: struct > - Target schema: struct; > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279) > at org.apache.spark.sql.Dataset.(Dataset.scala:197) > at org.apache.spark.sql.Dataset.(Dataset.scala:168) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57) > at org.apache.spark.sql.Dataset.as(Dataset.scala:357) > at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14624) Error at the end of installing Spark 1.6.1 using Spark-ec2 scipt
Mohaed Alibrahim created SPARK-14624: Summary: Error at the end of installing Spark 1.6.1 using Spark-ec2 scipt Key: SPARK-14624 URL: https://issues.apache.org/jira/browse/SPARK-14624 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Mohaed Alibrahim I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was OK, but , it failed to start httpd at the end of the installation. I followed exactly the instruction and I repeated the process many times, but there is no luck. - [timing] rstudio setup: 00h 00m 00s Setting up ganglia RSYNC'ing /etc/ganglia to slaves...ec.us-west-2.compute.amazonaws.com Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-.us-west-2.compute.amazonaws.com closed. Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory [FAILED] [timing] ganglia setup: 00h 00m 01s Connection to ec2-.us-west-2.compute.amazonaws.com closed. Spark standalone cluster started at http://ec2-...us-west-2.compute.amazonaws.com:8080 Ganglia started at http://ec2-.us-west-2.compute.amazonaws.com:5080/ganglia Done! -- httpd.conf: line 154: LoadModule authz_core_module modules/mod_authz_core.so So, If i commented this line, it shows another error in the following lines: LoadModule unixd_module modules/mod_unixd.so LoadModule access_compat_module modules/mod_access_compat.so LoadModule mpm_prefork_module modules/mod_mpm_prefork.so LoadModule php5_module modules/libphp-5.6.so --- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14507) Decide if we should still support CREATE EXTERNAL TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240616#comment-15240616 ] Yan commented on SPARK-14507: - In terms of Hive support vs Spark SQL support, the "external table" concept in Spark SQL seems to be beyond that in Hive, not just for CTAS. For Hive, an "external table" is only for the "schema-on-read" scenario on the data on, say, HDFS. It has its own kinda unique DDL semantics and security features different from normal SQL DB's. For Spark SQL's external table, as far as I understand, it could be a mapping to a data source table. I'm not sure whether this mapping would need special considerations regarding DDL semantics and security models as Hive external tables. > Decide if we should still support CREATE EXTERNAL TABLE AS SELECT > - > > Key: SPARK-14507 > URL: https://issues.apache.org/jira/browse/SPARK-14507 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Look like we support CREATE EXTERNAL TABLE AS SELECT by accident. Should we > still support it? Seems Hive does not support it. Based on the doc Impala, > seems Impala supports it. Right now, seems the rule of CreateTables in > HiveMetastoreCatalog.scala does not respect EXTERNAL keyword when > {{hive.convertCTAS}} is true and the CTAS query does not provide any storage > format. For this case, the table will become a MANAGED_TABLE and stored in > the default metastore location (not the user specified location). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid
[ https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240608#comment-15240608 ] Xiao Li commented on SPARK-14603: - To verify the error messages we issued from SessionCatalog and HiveSessionCatalog, we need to unify the error message we issued. Thus, I have to unify the exceptions at first. > SessionCatalog needs to check if a metadata operation is valid > -- > > Key: SPARK-14603 > URL: https://issues.apache.org/jira/browse/SPARK-14603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since we cannot really trust if the underlying external catalog can throw > exceptions when there is an invalid metadata operation, let's do it in > SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240607#comment-15240607 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the first step and reimplementing all RankingEvaluator methods in ML using DataFrames would be good after that. I will work on the reimplementation in several followup PRs. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240601#comment-15240601 ] Yanbo Liang commented on SPARK-10574: - Sure, I will sent a PR in a few days. Thanks! > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14622) Retain lost executors status
[ https://issues.apache.org/jira/browse/SPARK-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240550#comment-15240550 ] hujiayin commented on SPARK-14622: -- I think it is also better to maintain the number of lost executors. When click the number, the details information appear after that. > Retain lost executors status > > > Key: SPARK-14622 > URL: https://issues.apache.org/jira/browse/SPARK-14622 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Qingyang Hong >Priority: Minor > Fix For: 1.6.0 > > > In 'execturos' dashboard, it is necessary to maintain a list of those > executors who have been lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14623: Assignee: (was: Apache Spark) > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.1 >Reporter: hujiayin >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 0, 1, > 0, 1, 0, 1, 0, > 0, 0, 1, 0, 0, > 1, 0, 0, 0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240543#comment-15240543 ] Apache Spark commented on SPARK-14623: -- User 'hujy' has created a pull request for this issue: https://github.com/apache/spark/pull/12380 > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.1 >Reporter: hujiayin >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 0, 1, > 0, 1, 0, 1, 0, > 0, 0, 1, 0, 0, > 1, 0, 0, 0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14623: Assignee: Apache Spark > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.1 >Reporter: hujiayin >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 0, 1, > 0, 1, 0, 1, 0, > 0, 0, 1, 0, 0, > 1, 0, 0, 0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14623) add label binarizer
hujiayin created SPARK-14623: Summary: add label binarizer Key: SPARK-14623 URL: https://issues.apache.org/jira/browse/SPARK-14623 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.6.1 Reporter: hujiayin Priority: Minor Fix For: 2.0.0 It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14622) Retain lost executors status
Qingyang Hong created SPARK-14622: - Summary: Retain lost executors status Key: SPARK-14622 URL: https://issues.apache.org/jira/browse/SPARK-14622 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.6.0 Reporter: Qingyang Hong Priority: Minor Fix For: 1.6.0 In 'execturos' dashboard, it is necessary to maintain a list of those executors who have been lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7445) StringIndexer should handle binary labels properly
[ https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-7445: Comment: was deleted (was: If no one works on it, I'd like to submit a code for this issue.) > StringIndexer should handle binary labels properly > -- > > Key: SPARK-7445 > URL: https://issues.apache.org/jira/browse/SPARK-7445 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Priority: Minor > > StringIndexer orders labels by their counts. However, for binary labels, we > should really map negatives to 0 and positive to 1. So can put special rules > for binary labels: > 1. "+1"/"-1", "1"/"-1", "1"/"0" > 2. "yes"/"no" > 3. "true"/"false" > Another option is to allow users to provide a list or labels and we use the > ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14609) LOAD DATA
[ https://issues.apache.org/jira/browse/SPARK-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240509#comment-15240509 ] Xiao Li commented on SPARK-14609: - It is not hard, but we need to handle partitions and a few options. {noformat} LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] {noformat} I can take it. Thanks! > LOAD DATA > - > > Key: SPARK-14609 > URL: https://issues.apache.org/jira/browse/SPARK-14609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > For load command, it should be pretty easy to implement. We already call Hive > to load data when insert into Hive tables. So, we can follow the > implementation of that. For example, we load into hive table in > InsertIntoHiveTable command at > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L221-L225. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14621) add oracle hint optimizer
[ https://issues.apache.org/jira/browse/SPARK-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingyang Hong updated SPARK-14621: -- Flags: Patch Priority: Minor (was: Major) Description: Current SQL parser in SparkSQL can't identify hint optimizer in query, e.g. SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, DATE FROM BILL_TABLE. It is necessary to add such feature which will increase query efficiency. Issue Type: Improvement (was: Wish) Summary: add oracle hint optimizer (was: add) > add oracle hint optimizer > - > > Key: SPARK-14621 > URL: https://issues.apache.org/jira/browse/SPARK-14621 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Qingyang Hong >Priority: Minor > Fix For: 1.6.0 > > > Current SQL parser in SparkSQL can't identify hint optimizer in query, e.g. > SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, DATE FROM > BILL_TABLE. It is necessary to add such feature which will increase query > efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12133) Support dynamic allocation in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12133. --- Resolution: Fixed Fix Version/s: 2.0.0 > Support dynamic allocation in Spark Streaming > - > > Key: SPARK-12133 > URL: https://issues.apache.org/jira/browse/SPARK-12133 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming >Reporter: Andrew Or >Assignee: Tathagata Das > Fix For: 2.0.0 > > Attachments: dynamic-allocation-streaming-design.pdf > > > Dynamic allocation is a feature that allows your cluster resources to scale > up and down based on the workload. Currently it doesn't work well with Spark > streaming because of several reasons: > (1) Your executors may never be idle since they run something every N seconds > (2) You should have at least one receiver running always > (3) The existing heuristics don't take into account length of batch queue > ... > The goal of this JIRA is to provide better support for using dynamic > allocation in streaming. A design doc will be posted shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14621) add
Qingyang Hong created SPARK-14621: - Summary: add Key: SPARK-14621 URL: https://issues.apache.org/jira/browse/SPARK-14621 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 1.6.0 Reporter: Qingyang Hong Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14592) Create table like
[ https://issues.apache.org/jira/browse/SPARK-14592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-14592: Comment: was deleted (was: I am working on this...) > Create table like > - > > Key: SPARK-14592 > URL: https://issues.apache.org/jira/browse/SPARK-14592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14592) Create table like
[ https://issues.apache.org/jira/browse/SPARK-14592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-14592: Comment: was deleted (was: Will submit PR soon.) > Create table like > - > > Key: SPARK-14592 > URL: https://issues.apache.org/jira/browse/SPARK-14592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12133) Support dynamic allocation in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240453#comment-15240453 ] WilliamZhu commented on SPARK-12133: Here have a new Design: http://www.jianshu.com/p/ae7fdd4746f6 > Support dynamic allocation in Spark Streaming > - > > Key: SPARK-12133 > URL: https://issues.apache.org/jira/browse/SPARK-12133 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming >Reporter: Andrew Or >Assignee: Tathagata Das > Attachments: dynamic-allocation-streaming-design.pdf > > > Dynamic allocation is a feature that allows your cluster resources to scale > up and down based on the workload. Currently it doesn't work well with Spark > streaming because of several reasons: > (1) Your executors may never be idle since they run something every N seconds > (2) You should have at least one receiver running always > (3) The existing heuristics don't take into account length of batch queue > ... > The goal of this JIRA is to provide better support for using dynamic > allocation in streaming. A design doc will be posted shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240434#comment-15240434 ] zhengruifeng edited comment on SPARK-14516 at 4/14/16 1:56 AM: --- [~akamal] In my opinion, both supervised and unsupervised metrics shoud be added.Silhouette should be add first. I will create a online document. Thanks. was (Author: podongfeng): [~akamal] In my opinion, both supervised and unsupervised metrics shoud be added. And in unsupervised metrics, silhouette should be add first. I will create a online document. Thanks. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240434#comment-15240434 ] zhengruifeng commented on SPARK-14516: -- [~akamal] In my opinion, both supervised and unsupervised metrics shoud be added. And in unsupervised metrics, silhouette should be add first. I will create a online document. Thanks. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap
[ https://issues.apache.org/jira/browse/SPARK-14620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14620: Assignee: Apache Spark > Use/benchmark a better hash in AggregateHashMap > --- > > Key: SPARK-14620 > URL: https://issues.apache.org/jira/browse/SPARK-14620 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap
[ https://issues.apache.org/jira/browse/SPARK-14620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240411#comment-15240411 ] Apache Spark commented on SPARK-14620: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12379 > Use/benchmark a better hash in AggregateHashMap > --- > > Key: SPARK-14620 > URL: https://issues.apache.org/jira/browse/SPARK-14620 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap
[ https://issues.apache.org/jira/browse/SPARK-14620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14620: Assignee: (was: Apache Spark) > Use/benchmark a better hash in AggregateHashMap > --- > > Key: SPARK-14620 > URL: https://issues.apache.org/jira/browse/SPARK-14620 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap
Sameer Agarwal created SPARK-14620: -- Summary: Use/benchmark a better hash in AggregateHashMap Key: SPARK-14620 URL: https://issues.apache.org/jira/browse/SPARK-14620 Project: Spark Issue Type: Sub-task Reporter: Sameer Agarwal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14582) Increase the parallelism for small tables
[ https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240389#comment-15240389 ] Mark Hamstra commented on SPARK-14582: -- The total absence of any description in both this JIRA and the accompanying PR is really annoying -- especially because I find that queries involving small tables frequently suffer from using too much parallelism, not too little. > Increase the parallelism for small tables > - > > Key: SPARK-14582 > URL: https://issues.apache.org/jira/browse/SPARK-14582 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
[ https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14619: Description: When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. > Track internal accumulators (metrics) by stage attempt rather than stage > > > Key: SPARK-14619 > URL: https://issues.apache.org/jira/browse/SPARK-14619 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > When there are multiple attempts for a stage, we currently only reset > internal accumulator values if all the tasks are resubmitted. It would make > more sense to reset the accumulator values for each stage attempt. This will > allow us to eventually get rid of the internal flag in the Accumulator class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
[ https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240377#comment-15240377 ] Apache Spark commented on SPARK-14619: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12378 > Track internal accumulators (metrics) by stage attempt rather than stage > > > Key: SPARK-14619 > URL: https://issues.apache.org/jira/browse/SPARK-14619 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
[ https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14619: Assignee: Reynold Xin (was: Apache Spark) > Track internal accumulators (metrics) by stage attempt rather than stage > > > Key: SPARK-14619 > URL: https://issues.apache.org/jira/browse/SPARK-14619 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
[ https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14619: Assignee: Apache Spark (was: Reynold Xin) > Track internal accumulators (metrics) by stage attempt rather than stage > > > Key: SPARK-14619 > URL: https://issues.apache.org/jira/browse/SPARK-14619 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14618) RegressionEvaluator doc out of date
[ https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14618: Assignee: Apache Spark (was: Joseph K. Bradley) > RegressionEvaluator doc out of date > --- > > Key: SPARK-14618 > URL: https://issues.apache.org/jira/browse/SPARK-14618 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > In Spark 1.4, we negated some metrics from RegressionEvaluator since > CrossValidator always maximized metrics. This was fixed in 1.5, but the docs > were not updated. This issue is for updating the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14618) RegressionEvaluator doc out of date
[ https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14618: Assignee: Joseph K. Bradley (was: Apache Spark) > RegressionEvaluator doc out of date > --- > > Key: SPARK-14618 > URL: https://issues.apache.org/jira/browse/SPARK-14618 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > In Spark 1.4, we negated some metrics from RegressionEvaluator since > CrossValidator always maximized metrics. This was fixed in 1.5, but the docs > were not updated. This issue is for updating the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14618) RegressionEvaluator doc out of date
[ https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240374#comment-15240374 ] Apache Spark commented on SPARK-14618: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/12377 > RegressionEvaluator doc out of date > --- > > Key: SPARK-14618 > URL: https://issues.apache.org/jira/browse/SPARK-14618 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > In Spark 1.4, we negated some metrics from RegressionEvaluator since > CrossValidator always maximized metrics. This was fixed in 1.5, but the docs > were not updated. This issue is for updating the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14618) RegressionEvaluator doc out of date
Joseph K. Bradley created SPARK-14618: - Summary: RegressionEvaluator doc out of date Key: SPARK-14618 URL: https://issues.apache.org/jira/browse/SPARK-14618 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.6.1, 1.5.2, 2.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This issue is for updating the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage
Reynold Xin created SPARK-14619: --- Summary: Track internal accumulators (metrics) by stage attempt rather than stage Key: SPARK-14619 URL: https://issues.apache.org/jira/browse/SPARK-14619 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240364#comment-15240364 ] Joseph K. Bradley commented on SPARK-14489: --- (Oh, I had not refreshed the page before commenting, but it looks like my comments mesh with Nick's.) > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240360#comment-15240360 ] Joseph K. Bradley commented on SPARK-14489: --- I'd to try to separate a few issues here based on use cases and suggest the "right thing to do" in each case: * Deploying an ALSModel to make predictions: The model should make best-effort predictions, even for new users. I'd say new users should get recommendations based on the average user, for both the explicit and implicit settings. Providing a Param which makes the model output NaN for unknown users seems reasonable as an additional feature. * Evaluating an ALSModel on a held-out dataset: This is the same as the first case; the model should behave the same way it will when deployed. * Model tuning using CrossValidator: I'm less sure about this. Both of your suggestions seem reasonable (either returning NaN for missing users and ignoring NaN in the evaluator, or making best-effort predictions for all users). I also suspect it would be worthwhile to examine literature to find what tends to be best. E.g., should CrossValidator handle ranking specially by doing stratified sampling to divide each user or item's ratings evenly across folds of CV? If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the current behavior as the default and provide a Param which allows users to ignore NaNs. I'd be afraid of linear models not having enough regularization, getting NaNs in the coefficients, having all of its predictions ignored by the evaluator, etc. What do you think? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14614: Assignee: (was: Apache Spark) > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240351#comment-15240351 ] Apache Spark commented on SPARK-14614: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12376 > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14614: Assignee: Apache Spark > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14617) Remove deprecated APIs in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14617: Assignee: Apache Spark (was: Reynold Xin) > Remove deprecated APIs in TaskMetrics > - > > Key: SPARK-14617 > URL: https://issues.apache.org/jira/browse/SPARK-14617 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14617) Remove deprecated APIs in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14617: Assignee: Reynold Xin (was: Apache Spark) > Remove deprecated APIs in TaskMetrics > - > > Key: SPARK-14617 > URL: https://issues.apache.org/jira/browse/SPARK-14617 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14617) Remove deprecated APIs in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240331#comment-15240331 ] Apache Spark commented on SPARK-14617: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12375 > Remove deprecated APIs in TaskMetrics > - > > Key: SPARK-14617 > URL: https://issues.apache.org/jira/browse/SPARK-14617 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14617) Remove deprecated APIs in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14617: Summary: Remove deprecated APIs in TaskMetrics (was: Remove deprecated APIs in accumulators) > Remove deprecated APIs in TaskMetrics > - > > Key: SPARK-14617 > URL: https://issues.apache.org/jira/browse/SPARK-14617 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14617) Remove deprecated APIs in accumulators
Reynold Xin created SPARK-14617: --- Summary: Remove deprecated APIs in accumulators Key: SPARK-14617 URL: https://issues.apache.org/jira/browse/SPARK-14617 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency
[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240322#comment-15240322 ] Xiangrui Meng commented on SPARK-13944: --- There are more production workflows using RDD-based APIs than DataFrame-based APIs since many users are still running Spark 1.4 or earlier. It would be nice if we can keep binary compatibility on RDD-based APIs in Spark 2.0. Using type alias is not a good solution because 1) it is not Java-compatible, 2) it introduces dependency from the RDD-based API to mllib-local, which means future development on mllib-local might cause behavior changes or break changes to the RDD-based API. Since we already decided that the RDD-based API would go into maintenance mode in Spark 2.0. Leaving some old code there won't increase maintenance cost, compared with the type alias. We can provide a converter than converts all `mllib.linalg` types to `ml.linalg` types in Spark 2.0 to help users migrate to `ml.linalg`. > Separate out local linear algebra as a standalone module without Spark > dependency > - > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Blocker > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14607) Partition pruning is case sensitive even with HiveContext
[ https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-14607: -- Assignee: Davies Liu > Partition pruning is case sensitive even with HiveContext > - > > Key: SPARK-14607 > URL: https://issues.apache.org/jira/browse/SPARK-14607 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > It should not be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly
[ https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14484. Resolution: Fixed Assignee: Davies Liu > Fail to create parquet filter if the column name does not match exactly > --- > > Key: SPARK-14484 > URL: https://issues.apache.org/jira/browse/SPARK-14484 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > There will be exception about "no key found" from > ParquetFilters.createFilter() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14607) Partition pruning is case sensitive even with HiveContext
[ https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14607. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12371 [https://github.com/apache/spark/pull/12371] > Partition pruning is case sensitive even with HiveContext > - > > Key: SPARK-14607 > URL: https://issues.apache.org/jira/browse/SPARK-14607 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu > Fix For: 2.0.0 > > > It should not be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240299#comment-15240299 ] Shixiong Zhu commented on SPARK-14559: -- When this happens? When you are stopping the SparkContext? > Netty RPC didn't check channel is active before sending message > --- > > Key: SPARK-14559 > URL: https://issues.apache.org/jira/browse/SPARK-14559 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65 >Reporter: cen yuhai > > I have a long-running service. After running for serveral hours, It throwed > these exceptions. I found that before sending rpc request by calling sendRpc > method in TransportClient, there is no check that whether the channel is > still open or active ? > java.nio.channels.ClosedChannelException > 4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 5635696155204230556 to > bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio. > channels.ClosedChannelException > 4866 java.nio.channels.ClosedChannelException > 4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 7319486003318455703 to > bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio. > channels.ClosedChannelException > 4868 java.nio.channels.ClosedChannelException > 4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9041854451893215954 to > bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio. > channels.ClosedChannelException > 4870 java.nio.channels.ClosedChannelException > 4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 6046473497871624501 to > bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio. > channels.ClosedChannelException > 4872 java.nio.channels.ClosedChannelException > 4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9085605650438705047 to > bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio. > channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240300#comment-15240300 ] Dongjoon Hyun commented on SPARK-14614: --- Since 1.3.0. :) > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240303#comment-15240303 ] Dongjoon Hyun commented on SPARK-14614: --- I'll send a PR soon. Actually, I tested hive 2.0 today. > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency
[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240291#comment-15240291 ] DB Tsai commented on SPARK-13944: - Can you elaborate the automatic conversion in VectorUDT? We will add some utilities for converting the vectors. Implicit conversion will be provided for users to migrate to new vector. Thanks. > Separate out local linear algebra as a standalone module without Spark > dependency > - > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Blocker > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature
[ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14610: Assignee: Apache Spark > Remove superfluous split from random forest findSplitsForContinousFeature > - > > Key: SPARK-14610 > URL: https://issues.apache.org/jira/browse/SPARK-14610 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark > > Currently, the method findSplitsForContinuousFeature in random forest > produces an unnecessary split. For example, if a continuous feature has > unique values: (1, 2, 3), then the possible splits generated by this method > are: > * {1|2,3} > * {1,2|3} > * {1,2,3|} > The following unit test is quite clearly incorrect: > {code:title=rf.scala|borderStyle=solid} > val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) > val splits = > RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) > assert(splits.length === 3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature
[ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14610: Assignee: (was: Apache Spark) > Remove superfluous split from random forest findSplitsForContinousFeature > - > > Key: SPARK-14610 > URL: https://issues.apache.org/jira/browse/SPARK-14610 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson > > Currently, the method findSplitsForContinuousFeature in random forest > produces an unnecessary split. For example, if a continuous feature has > unique values: (1, 2, 3), then the possible splits generated by this method > are: > * {1|2,3} > * {1,2|3} > * {1,2,3|} > The following unit test is quite clearly incorrect: > {code:title=rf.scala|borderStyle=solid} > val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) > val splits = > RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) > assert(splits.length === 3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature
[ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240274#comment-15240274 ] Apache Spark commented on SPARK-14610: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/12374 > Remove superfluous split from random forest findSplitsForContinousFeature > - > > Key: SPARK-14610 > URL: https://issues.apache.org/jira/browse/SPARK-14610 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson > > Currently, the method findSplitsForContinuousFeature in random forest > produces an unnecessary split. For example, if a continuous feature has > unique values: (1, 2, 3), then the possible splits generated by this method > are: > * {1|2,3} > * {1,2|3} > * {1,2,3|} > The following unit test is quite clearly incorrect: > {code:title=rf.scala|borderStyle=solid} > val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) > val splits = > RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) > assert(splits.length === 3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240264#comment-15240264 ] Stephane Maarek commented on SPARK-12741: - Hi Sean, Not sure what you mean... In my SO: when I run against just a spark shell, it works. When I run against YARN with spark-shell --master yarn --deploy-mode client, the outcome of (firstCount, secondCount) is never the same twice. Although, the more I run the count under the same session, eventually it converges to the right result. When I run just a spark-shell, I get the right result right away. I can't share any data I'm sorry... the test case would literally be firstCount = 2395 It also seems that behavior happens on smaller datasets... but I can't tell for sure. I'm sorry I can't help more... If you have detailed steps you'd like me to run or tests you'd like me to run, let me know Regards Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240259#comment-15240259 ] Bo Meng commented on SPARK-14614: - I have tried on Hive 1.2.1, actually this function seems dropped out. > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240252#comment-15240252 ] Joseph K. Bradley commented on SPARK-14409: --- Thanks for writing this! I just made a few comments too. Wrapping RankingMetrics seems fine to me, though later on it would be worth re-implementing it using DataFrames and testing performance changes. The initial PR should not add new metrics, but follow-up ones can. Also, we'll need to follow up this issue with one to think about how to use ALS with CrossValidator. I'll comment on the linked JIRA for that. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Description: {code:title=tpcds q44} select asceding.rnk, i1.i_product_name best_performing, i2.i_product_name worst_performing from(select * from (select item_sk,rank() over (order by rank_col asc) rnk from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 4 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 4 and ss_addr_sk is null group by ss_store_sk))V1)V11 where rnk < 11) asceding, (select * from (select item_sk,rank() over (order by rank_col desc) rnk from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 4 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 4 and ss_addr_sk is null group by ss_store_sk))V2)V21 where rnk < 11) descending, item i1, item i2 where asceding.rnk = descending.rnk and i1.i_item_sk=asceding.item_sk and i2.i_item_sk=descending.item_sk order by asceding.rnk limit 100; {code} {noformat} bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 --executor-cores 2 --database hadoopds1g -f q44.sql {noformat} {noformat} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition, None +- WholeStageCodegen : +- Project [item_sk#0,rank_col#1] : +- Filter havingCondition#219: boolean :+- TungstenAggregate(key=[ss_item_sk#12], functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], output=[havingCondition#219,item_sk#0,rank_col#1]) : +- INPUT +- Exchange hashpartitioning(ss_item_sk#12,200), None +- WholeStageCodegen : +- TungstenAggregate(key=[ss_item_sk#12], functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], output=[ss_item_sk#12,sum#612,count#613L]) : +- Project [ss_item_sk#12,ss_net_profit#32] :+- Filter (ss_store_sk#17 = 4) : +- INPUT +- Scan ParquetRelation: hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] InputPaths: hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales, PushedFilters: [EqualTo(ss_store_sk,4)] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.InputAdapter.upstream(WholeStageCodegen.scala:176) at org.apache.spark.sql.execution.Filter.upstream(basicOperators.scala:73) at org.apache.spark.sql.execution.Project.upstream(basicOperators.scala:35) at org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:279) at org.apache.spark.sql.execution.SparkPlan$$anonfun$exec
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Environment: (was: spark 1.5.1 (official binary distribution) running on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)) > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > {code:title=/tmp/bug.py} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext() > sqlc = SQLContext(sc) > R = Row('id', 'foo') > r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) > q = sqlc.createDataFrame(sc.parallelize([R('', > 'bar')])) > q.write.parquet('/tmp/1.parq') > q = sqlc.read.parquet('/tmp/1.parq') > j = r.join(q, r.id == q.id) > print j.count() > {code} > {noformat} > [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py > [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq > {noformat} > {noformat} > 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in > 119.90324 ms > Traceback (most recent call last): > File "/tmp/bug.py", line 13, in > print j.count() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 268, in count > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, > in deco > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, > tree: > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#10L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#13L]) >TungstenProject > BroadcastHashJoin [id#0], [id#8], BuildRight > TungstenProject [id#0] > Scan PhysicalRDD[id#0,foo#1] > ConvertToUnsafe > Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Note this happens only under following condition: > # executor memory >= 32GB (doesn't fail with up to 31 GB) > # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or > more then 24 chars) > # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Affects Version/s: (was: 1.5.1) 2.0.0 > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 1.5.1 (official binary distribution) running on > hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8) >Reporter: JESSE CHEN > > {code:title=/tmp/bug.py} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext() > sqlc = SQLContext(sc) > R = Row('id', 'foo') > r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) > q = sqlc.createDataFrame(sc.parallelize([R('', > 'bar')])) > q.write.parquet('/tmp/1.parq') > q = sqlc.read.parquet('/tmp/1.parq') > j = r.join(q, r.id == q.id) > print j.count() > {code} > {noformat} > [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py > [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq > {noformat} > {noformat} > 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in > 119.90324 ms > Traceback (most recent call last): > File "/tmp/bug.py", line 13, in > print j.count() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 268, in count > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, > in deco > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, > tree: > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#10L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#13L]) >TungstenProject > BroadcastHashJoin [id#0], [id#8], BuildRight > TungstenProject [id#0] > Scan PhysicalRDD[id#0,foo#1] > ConvertToUnsafe > Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Note this happens only under following condition: > # executor memory >= 32GB (doesn't fail with up to 31 GB) > # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or > more then 24 chars) > # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) ---
[jira] [Created] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
JESSE CHEN created SPARK-14616: -- Summary: TreeNodeException running Q44 and 58 on Parquet tables Key: SPARK-14616 URL: https://issues.apache.org/jira/browse/SPARK-14616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: spark 1.5.1 (official binary distribution) running on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8) Reporter: JESSE CHEN {code:title=/tmp/bug.py} from pyspark import SparkContext from pyspark.sql import SQLContext, Row sc = SparkContext() sqlc = SQLContext(sc) R = Row('id', 'foo') r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) q = sqlc.createDataFrame(sc.parallelize([R('', 'bar')])) q.write.parquet('/tmp/1.parq') q = sqlc.read.parquet('/tmp/1.parq') j = r.join(q, r.id == q.id) print j.count() {code} {noformat} [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq {noformat} {noformat} 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 119.90324 ms Traceback (most recent call last): File "/tmp/bug.py", line 13, in print j.count() File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 268, in count File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#10L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#13L]) TungstenProject BroadcastHashJoin [id#0], [id#8], BuildRight TungstenProject [id#0] Scan PhysicalRDD[id#0,foo#1] ConvertToUnsafe Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {noformat} Note this happens only under following condition: # executor memory >= 32GB (doesn't fail with up to 31 GB) # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or more then 24 chars) # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms
[ https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-14615: --- Assignee: DB Tsai > Use the new ML Vector and Matrix in the ML pipeline based algorithms > - > > Key: SPARK-14615 > URL: https://issues.apache.org/jira/browse/SPARK-14615 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: DB Tsai >Assignee: DB Tsai > > Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new > vector and matrix type in the new ml pipeline based apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms
DB Tsai created SPARK-14615: --- Summary: Use the new ML Vector and Matrix in the ML pipeline based algorithms Key: SPARK-14615 URL: https://issues.apache.org/jira/browse/SPARK-14615 Project: Spark Issue Type: Sub-task Reporter: DB Tsai Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2
[ https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14541: Assignee: Apache Spark > SQL function: IFNULL, NULLIF, NVL and NVL2 > -- > > Key: SPARK-14541 > URL: https://issues.apache.org/jira/browse/SPARK-14541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > It will be great to have these SQL functions: > IFNULL, NULLIF, NVL, NVL2 > The meaning of these functions could be found in oracle docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2
[ https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14541: Assignee: (was: Apache Spark) > SQL function: IFNULL, NULLIF, NVL and NVL2 > -- > > Key: SPARK-14541 > URL: https://issues.apache.org/jira/browse/SPARK-14541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > It will be great to have these SQL functions: > IFNULL, NULLIF, NVL, NVL2 > The meaning of these functions could be found in oracle docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2
[ https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240206#comment-15240206 ] Apache Spark commented on SPARK-14541: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12373 > SQL function: IFNULL, NULLIF, NVL and NVL2 > -- > > Key: SPARK-14541 > URL: https://issues.apache.org/jira/browse/SPARK-14541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > It will be great to have these SQL functions: > IFNULL, NULLIF, NVL, NVL2 > The meaning of these functions could be found in oracle docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14614) Add `bound` function
Dongjoon Hyun created SPARK-14614: - Summary: Add `bound` function Key: SPARK-14614 URL: https://issues.apache.org/jira/browse/SPARK-14614 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dongjoon Hyun This issue aims to add `bound` function (aka Banker's round) by extending current `round` implementation. Hive supports `bround` since 1.3.0. [Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. {code} hive> select round(2.5), bround(2.5); OK 3.0 2.0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14614) Add `bround` function
[ https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14614: -- Summary: Add `bround` function (was: Add `bound` function) > Add `bround` function > - > > Key: SPARK-14614 > URL: https://issues.apache.org/jira/browse/SPARK-14614 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue aims to add `bound` function (aka Banker's round) by extending > current `round` implementation. > Hive supports `bround` since 1.3.0. [Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > {code} > hive> select round(2.5), bround(2.5); > OK > 3.0 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14613) Add @Since into the matrix and vector classes in spark-mllib-local
DB Tsai created SPARK-14613: --- Summary: Add @Since into the matrix and vector classes in spark-mllib-local Key: SPARK-14613 URL: https://issues.apache.org/jira/browse/SPARK-14613 Project: Spark Issue Type: Sub-task Reporter: DB Tsai In spark-mllib-local, we're no longer to be able to use @Since annotation. As a result, we will switch to standard java doc style using /* @Since /*. This task will add those new APIs as @Since 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14612) Consolidate the version of dependencies in mllib and mllib-local into one place
DB Tsai created SPARK-14612: --- Summary: Consolidate the version of dependencies in mllib and mllib-local into one place Key: SPARK-14612 URL: https://issues.apache.org/jira/browse/SPARK-14612 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: DB Tsai Both spark-mllib-local and spark-mllib depend on breeze, but we specify the version of breeze in both pom files. Also, org.json4s has the same issue. For maintainability, we should define the version of dependencies in one place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14457) Write a end to end test for DataSet with UDT
[ https://issues.apache.org/jira/browse/SPARK-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joan Goyeau closed SPARK-14457. --- Resolution: Fixed > Write a end to end test for DataSet with UDT > > > Key: SPARK-14457 > URL: https://issues.apache.org/jira/browse/SPARK-14457 > Project: Spark > Issue Type: Task > Components: Tests >Reporter: Joan Goyeau >Priority: Minor > > I don't know if UDTs are supported by DataSets yet but if yes we should write > at least a end to end test for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7861) Python wrapper for OneVsRest
[ https://issues.apache.org/jira/browse/SPARK-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7861: - Shepherd: Joseph K. Bradley Assignee: Xusen Yin (was: Ram Sriharsha) Component/s: ML Issue Type: New Feature (was: Improvement) > Python wrapper for OneVsRest > > > Key: SPARK-7861 > URL: https://issues.apache.org/jira/browse/SPARK-7861 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Ram Sriharsha >Assignee: Xusen Yin >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt
Kshitij Badani created SPARK-14611: -- Summary: Second attempt observed after AM fails due to max number of executor failure in first attempt Key: SPARK-14611 URL: https://issues.apache.org/jira/browse/SPARK-14611 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Environment: RHEL7 64 bit Reporter: Kshitij Badani I submitted a spark application in yarn-cluster mode. My cluster has two Nodemanagers. After submitting the spark application, I tried to restart Nodemanager on node1 actively running a few executor and this node was not running the AM. During the time when the Nodemanager was restarting, 3 of the executors running on node2 failed with 'failed to connect to external shuffle server' as follows java.io.IOException: Failed to connect to node1 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194) at org.apache.spark.executor.Executor.(Executor.scala:86) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: node1 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) Each of the 3 executors tried to connect to external shuffle service 2 more times, all during the period when the NM on node1 was restarting and eventually failed Since 3 executors failed, the AM exitted with FAILURE status and I can see following message in the application logs INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (3) reached) After this, we saw a 2nd application attempt which succeeded as the NM had came up back. Should we see a 2nd attempt in such scenarios where multiple executors have failed in the 1st attempt due to not being able to connect to external shuffle service? What if the 2nd attempt also fails due to similar reason, in that case it would be a heavy penalty? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2
[ https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240157#comment-15240157 ] Bo Meng commented on SPARK-14541: - I will try to do it one by one. > SQL function: IFNULL, NULLIF, NVL and NVL2 > -- > > Key: SPARK-14541 > URL: https://issues.apache.org/jira/browse/SPARK-14541 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > It will be great to have these SQL functions: > IFNULL, NULLIF, NVL, NVL2 > The meaning of these functions could be found in oracle docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14607) Partition pruning is case sensitive even with HiveContext
[ https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240130#comment-15240130 ] Apache Spark commented on SPARK-14607: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12371 > Partition pruning is case sensitive even with HiveContext > - > > Key: SPARK-14607 > URL: https://issues.apache.org/jira/browse/SPARK-14607 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu > > It should not be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly
[ https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14484: Assignee: Apache Spark > Fail to create parquet filter if the column name does not match exactly > --- > > Key: SPARK-14484 > URL: https://issues.apache.org/jira/browse/SPARK-14484 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > There will be exception about "no key found" from > ParquetFilters.createFilter() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org