[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803628#comment-15803628 ] JESSE CHEN commented on SPARK-19068: Well, though it does not affect the correctness of the results, but a query that seemingly takes only 30 minutes now takes 2.5 hours is a concern to Spark users. I used the 'spark-sql' shell so not until the shell exit, normal users will not know the query actually finished. Plus, Spark is hogging resources (memory and cores) until SparkContext exits, so this is an usability and trust issue. I also think this always occurs on high volume and on a large cluster. As Spark is being adapted by enterprise users, this issue will be in the fore-front. I do think there is a fundamental timing issue here. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-19068: --- Attachment: sparklog.tar.gz This is the Spark console output in which you can find settings and sequence of events. At end you will see the "never-ending" event dropping messages. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > Attachments: sparklog.tar.gz > > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(51,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(535,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(63,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(333,WrappedArray()) > 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(484,WrappedArray()) > (omitted) > {noformat} > The message itself maybe a reasonable response to a already stopped > SparkListenerBus (so subsequent events are thrown away with that ERROR > message). The issue is that because SparkContext does NOT exit until all > these ERROR/events are reported, which is a huge number in our setup -- and > this can take, in some cases, hours!!! > We tried increasing the > Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 > from 10K, this still occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr
[ https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-19068: --- Description: On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in order to use all RAM and cores for a 100TB Spark SQL workload. Long-running queries tend to report the following ERRORs {noformat} 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(136,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(853,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(395,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(736,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(439,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(16,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(307,WrappedArray()) 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(51,WrappedArray()) 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(535,WrappedArray()) 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(63,WrappedArray()) 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(333,WrappedArray()) 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(484,WrappedArray()) (omitted) {noformat} The message itself maybe a reasonable response to a already stopped SparkListenerBus (so subsequent events are thrown away with that ERROR message). The issue is that because SparkContext does NOT exit until all these ERROR/events are reported, which is a huge number in our setup -- and this can take, in some cases, hours!!! We tried increasing the Adding default property: spark.scheduler.listenerbus.eventqueue.size=13 from 10K, this still occurs. > Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(41,WrappedArray()) > -- > > Key: SPARK-19068 > URL: https://issues.apache.org/jira/browse/SPARK-19068 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: RHEL 7.2 >Reporter: JESSE CHEN > > On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in > order to use all RAM and cores for a 100TB Spark SQL workload. Long-running > queries tend to report the following ERRORs > {noformat} > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(136,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(853,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(395,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(736,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(439,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(16,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(307,WrappedArray()) > 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has > already stopped! Dropping event > SparkListenerExecutorMetricsUpda
[jira] [Created] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr
JESSE CHEN created SPARK-19068: -- Summary: Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,WrappedArray()) Key: SPARK-19068 URL: https://issues.apache.org/jira/browse/SPARK-19068 Project: Spark Issue Type: Bug Affects Versions: 2.1.0 Environment: RHEL 7.2 Reporter: JESSE CHEN -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18745: --- Labels: (was: core dump) > java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB) > - > > Key: SPARK-18745 > URL: https://issues.apache.org/jira/browse/SPARK-18745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN >Assignee: Kazuaki Ishizaki >Priority: Critical > Fix For: 2.1.0 > > > Running query 68 with decreased executor memory (using 12GB executors instead > of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave > IndexOutOfBoundsException. > The query is as follows: > {noformat} > [select c_last_name >,c_first_name >,ca_city >,bought_city >,ss_ticket_number >,extended_price >,extended_tax >,list_price > from (select ss_ticket_number > ,ss_customer_sk > ,ca_city bought_city > ,sum(ss_ext_sales_price) extended_price > ,sum(ss_ext_list_price) list_price > ,sum(ss_ext_tax) extended_tax >from store_sales >,date_dim >,store >,household_demographics >,customer_address >where store_sales.ss_sold_date_sk = date_dim.d_date_sk > and store_sales.ss_store_sk = store.s_store_sk > and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk > and store_sales.ss_addr_sk = customer_address.ca_address_sk > and date_dim.d_dom between 1 and 2 > and (household_demographics.hd_dep_count = 8 or > household_demographics.hd_vehicle_count= -1) > and date_dim.d_year in (2000,2000+1,2000+2) > and store.s_city in ('Plainview','Rogers') >group by ss_ticket_number >,ss_customer_sk >,ss_addr_sk,ca_city) dn > ,customer > ,customer_address current_addr > where ss_customer_sk = c_customer_sk >and customer.c_current_addr_sk = current_addr.ca_address_sk >and current_addr.ca_city <> bought_city > order by c_last_name > ,ss_ticket_number > limit 100] > {noformat} > Spark output that showed the exception: > {noformat} > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) > at > org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodeg
[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18745: --- Description: Running query 68 with decreased executor memory (using 12GB executors instead of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave IndexOutOfBoundsException. The query is as follows: {noformat} [select c_last_name ,c_first_name ,ca_city ,bought_city ,ss_ticket_number ,extended_price ,extended_tax ,list_price from (select ss_ticket_number ,ss_customer_sk ,ca_city bought_city ,sum(ss_ext_sales_price) extended_price ,sum(ss_ext_list_price) list_price ,sum(ss_ext_tax) extended_tax from store_sales ,date_dim ,store ,household_demographics ,customer_address where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_store_sk = store.s_store_sk and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk and store_sales.ss_addr_sk = customer_address.ca_address_sk and date_dim.d_dom between 1 and 2 and (household_demographics.hd_dep_count = 8 or household_demographics.hd_vehicle_count= -1) and date_dim.d_year in (2000,2000+1,2000+2) and store.s_city in ('Plainview','Rogers') group by ss_ticket_number ,ss_customer_sk ,ss_addr_sk,ca_city) dn ,customer ,customer_address current_addr where ss_customer_sk = c_customer_sk and customer.c_current_addr_sk = current_addr.ca_address_sk and current_addr.ca_city <> bought_city order by c_last_name ,ss_ticket_number limit 100] {noformat} Spark output that showed the exception: {noformat} org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:560) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org
[jira] [Created] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
JESSE CHEN created SPARK-18745: -- Summary: java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB) Key: SPARK-18745 URL: https://issues.apache.org/jira/browse/SPARK-18745 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: JESSE CHEN Assignee: Kazuaki Ishizaki Priority: Critical Fix For: 2.1.0 Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors. The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error). Spark output that showed the exception: {noformat} 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_74 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch. Container id: container_e68_1478924651089_0018_01_74 Exit code: 134 Exception message: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.la
[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18458: --- Fix Version/s: (was: 2.0.0) > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN > Labels: core, dump > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at org.apache.hadoop.util.Shell.run(Shell.java:456) > at > org.apache.hadoop.util.Shel
[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18458: --- Description: Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors. The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error). Spark output that showed the exception: {noformat} 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_74 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch. Container id: container_e68_1478924651089_0018_01_74 Exit code: 134 Exception message: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.r
[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18458: --- Labels: core dump (was: tpcds-result-mismatch) > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN > Labels: core, dump > Fix For: 2.0.0 > > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at org.apache.hadoop.util.Shell.run(Shell.java:4
[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
[ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-18458: --- Affects Version/s: (was: 1.6.0) 2.1.0 > core dumped running Spark SQL on large data volume (100TB) > -- > > Key: SPARK-18458 > URL: https://issues.apache.org/jira/browse/SPARK-18458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: JESSE CHEN > Labels: core, dump > Fix For: 2.0.0 > > > Running a query on 100TB parquet database using the Spark master dated 11/04 > dump cores on Spark executors. > The query is TPCDS query 82 (though this query is not the only one can > produce this core dump, just the easiest one to re-create the error). > Spark output that showed the exception: > {noformat} > 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_e68_1478924651089_0018_01_74 on > host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_e68_1478924651089_0018_01_74 > Exit code: 134 > Exception message: /bin/bash: line 1: 4031216 Aborted (core > dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java > -server -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 > Aborted (core dumped) > /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server > -Xmx24576m > -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' > -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74 > -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 > --hostname mer05x.svl.ibm.com --cores 2 --app-id > application_1478924651089_0018 --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar > --user-class-path > file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar > > > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout > 2> > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) > at org.apache.hadoop.util.Shell
[jira] [Created] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)
JESSE CHEN created SPARK-18458: -- Summary: core dumped running Spark SQL on large data volume (100TB) Key: SPARK-18458 URL: https://issues.apache.org/jira/browse/SPARK-18458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Fix For: 2.0.0 Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL has right answer but in wrong order (and there is an 'order by' in the query). Actual results: {noformat} store,9797,0.8000,2,2] [store,12641,0.81609195402298850575,3,3] [store,6661,0.92207792207792207792,7,7] [store,13013,0.94202898550724637681,8,8] [store,9029,1.,10,10] [web,15597,0.66197183098591549296,3,3] [store,14925,0.96470588235294117647,9,9] [store,4063,1.,10,10] [catalog,8929,0.7625,7,7] [store,11589,0.82653061224489795918,6,6] [store,1171,0.82417582417582417582,5,5] [store,9471,0.7750,1,1] [catalog,12577,0.65591397849462365591,3,3] [web,97,0.90361445783132530120,9,8] [web,85,0.85714285714285714286,8,7] [catalog,361,0.74647887323943661972,5,5] [web,2915,0.69863013698630136986,4,4] [web,117,0.9250,10,9] [catalog,9295,0.77894736842105263158,9,9] [web,3305,0.7375,6,16] [catalog,16215,0.79069767441860465116,10,10] [web,7539,0.5900,1,1] [catalog,17543,0.57142857142857142857,1,1] [catalog,3411,0.71641791044776119403,4,4] [web,11933,0.71717171717171717172,5,5] [catalog,14513,0.63541667,2,2] [store,15839,0.81632653061224489796,4,4] [web,3337,0.62650602409638554217,2,2] [web,5299,0.92708333,11,10] [catalog,8189,0.74698795180722891566,6,6] [catalog,14869,0.77173913043478260870,8,8] [web,483,0.8000,7,6] {noformat} Expected results: {noformat} +-+---++-+---+ | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | +-+---++-+---+ | catalog | 17543 | .5714285714285714 | 1 | 1 | | catalog | 14513 | .63541666 | 2 | 2 | | catalog | 12577 | .6559139784946236 | 3 | 3 | | catalog | 3411 | .7164179104477611 | 4 | 4 | | catalog | 361 | .7464788732394366 | 5 | 5 | | catalog | 8189 | .7469879518072289 | 6 | 6 | | catalog | 8929 | .7625 | 7 | 7 | | catalog | 14869 | .7717391304347826 | 8 | 8 | | catalog | 9295 | .7789473684210526 | 9 | 9 | | catalog | 16215 | .7906976744186046 | 10 |10 | | store | 9471 | .7750 | 1 | 1 | | store | 9797 | .8000 | 2 | 2 | | store | 12641 | .8160919540229885 | 3 | 3 | | store | 15839 | .8163265306122448 | 4 | 4 | | store | 1171 | .8241758241758241 | 5 | 5 | | store | 11589 | .8265306122448979 | 6 | 6 | | store | 6661 | .9220779220779220 | 7 | 7 | | store | 13013 | .9420289855072463 | 8 | 8 | | store | 14925 | .9647058823529411 | 9 | 9 | | store | 4063 | 1. | 10 |10 | | store | 9029 | 1. | 10 |10 | | web | 7539 | .5900 | 1 | 1 | | web | 3337 | .6265060240963855 | 2 | 2 | | web | 15597 | .6619718309859154 | 3 | 3 | | web | 2915 | .6986301369863013 | 4 | 4 | | web | 11933 | .7171717171717171 | 5 | 5 | | web | 3305 | .7375 | 6 |16 | | web | 483 | .8000 | 7 | 6 | | web |85 | .8571428571428571 | 8 | 7 | | web |97 | .9036144578313253 | 9 | 8 | | web | 117 | .9250 | 10 | 9 | | web | 5299 | .92708333 | 11 |10 | +-+---++-+---+ {noformat} Query used: {noformat} -- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION select 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank
[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347338#comment-15347338 ] JESSE CHEN commented on SPARK-13288: [~AlexSparkJiang]The code is: val multiTweetStreams=(1 to numStreams).map {i => KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) } // unified stream val tweetStream= ssc.union(multiTweetStreams) > [1.6.0] Memory leak in Spark streaming > -- > > Key: SPARK-13288 > URL: https://issues.apache.org/jira/browse/SPARK-13288 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: Bare metal cluster > RHEL 6.6 >Reporter: JESSE CHEN > Labels: streaming > > Streaming in 1.6 seems to have a memory leak. > Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 > showed a gradual increasing processing time. > The app is simple: 1 Kafka receiver of tweet stream and 20 executors > processing the tweets in 5-second batches. > Spark 1.5.0 handles this smoothly and did not show increasing processing time > in the 40-minute test; but 1.6 showed increasing time about 8 minutes into > the test. Please see chart here: > https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 > I captured heap dumps in two version and did a comparison. I noticed the Byte > is using 50X more space in 1.5.1. > Here are some top classes in heap histogram and references. > Heap Histogram > > All Classes (excluding platform) > 1.6.0 Streaming 1.5.1 Streaming > Class Instance Count Total Size Class Instance Count Total > Size > class [B 84533,227,649,599 class [B5095 > 62,938,466 > class [C 44682 4,255,502 class [C130482 > 12,844,182 > class java.lang.reflect.Method90591,177,670 class > java.lang.String 130171 1,562,052 > > > References by TypeReferences by Type > > class [B [0x640039e38]class [B [0x6c020bb08] > > > Referrers by Type Referrers by Type > > Class Count Class Count > java.nio.HeapByteBuffer 3239 > sun.security.util.DerInputBuffer1233 > sun.security.util.DerInputBuffer 1233 > sun.security.util.ObjectIdentifier 620 > sun.security.util.ObjectIdentifier620 [[B 397 > [Ljava.lang.Object; 408 java.lang.reflect.Method > 326 > > The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. > The Java.nio.HeapByteBuffer referencing class did not show up in top in > 1.5.1. > I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them > here > https://ibm.box.com/sparkstreaming-jstack160 > https://ibm.box.com/sparkstreaming-jstack151 > Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289356#comment-15289356 ] JESSE CHEN commented on SPARK-15372: [~freiss] I agree with you. In addition, in order to match TPC official result, we are allowed to use minor query modification for treatment of null strings (e.g., coalesce), so the following query now runs and returns the matching results to TPC: {noformat} select c_customer_id as customer_id ,concat(c_last_name , ', ' , coalesce(c_first_name,'')) as customername from customer ,customer_address ,customer_demographics ,household_demographics ,income_band ,store_returns where ca_city = 'Edgewood' and c_current_addr_sk = ca_address_sk and ib_lower_bound >= 38128 and ib_upper_bound <= 38128 + 5 and ib_income_band_sk = hd_income_band_sk and cd_demo_sk = c_current_cdemo_sk and hd_demo_sk = c_current_hdemo_sk and sr_cdemo_sk = cd_demo_sk order by c_customer_id limit 100; {noformat} > TPC-DS Qury 84 returns wrong results against TPC official > - > > Key: SPARK-15372 > URL: https://issues.apache.org/jira/browse/SPARK-15372 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Labels: SPARK-15071 > > The official TPC-DS query 84 returns wrong results when compared to its > official answer set. > The query itself is: > {noformat} > select c_customer_id as customer_id >,concat(c_last_name , ', ' , c_first_name) as customername > from customer > ,customer_address > ,customer_demographics > ,household_demographics > ,income_band > ,store_returns > where ca_city = 'Edgewood' >and c_current_addr_sk = ca_address_sk >and ib_lower_bound >= 38128 >and ib_upper_bound <= 38128 + 5 >and ib_income_band_sk = hd_income_band_sk >and cd_demo_sk = c_current_cdemo_sk >and hd_demo_sk = c_current_hdemo_sk >and sr_cdemo_sk = cd_demo_sk > order by c_customer_id > limit 100; > {noformat} > Spark 2.0 build 0517 returned the following result: > {noformat} > AIPG Carter, Rodney > AKMBBAAA Mcarthur, Emma > CBNHBAAA Wells, Ron > DBME Vera, Tina > DBME Vera, Tina > DHKGBAAA Scott, Pamela > EIIBBAAA Atkins, Susan > FKAH Batiste, Ernest > GHMA Mitchell, Gregory > IAODBAAA Murray, Karen > IEOK Solomon, Clyde > IIBO Owens, David > IPDC Wallace, Eric > IPIM Hayward, Benjamin > JCIK Ramos, Donald > KFJE Roberts, Yvonne > KPGBBAAA NULL < ??? questionable row > LCLABAAA Whitaker, Lettie > MGME Sharp, Michael > MIGBBAAA Montgomery, Jesenia > MPDK Lopez, Isabel > NEOM Powell, Linda > NKPC Shaffer, Sergio > NOCK Vargas, James > OGJEBAAA Owens, Denice > {noformat} > Official answer set (which is correct!) > {noformat} > AIPG Carter , Rodney > AKMBBAAA Mcarthur, Emma > CBNHBAAA Wells , Ron > DBME Vera, Tina > DBME Vera, Tina > DHKGBAAA Scott , Pamela > EIIBBAAA Atkins , Susan > FKAH Batiste , Ernest > GHMA Mitchell, Gregory > IAODBAAA Murray , Karen > IEOK Solomon , Clyde > IIBO Owens , David > IPDC Wallace , Eric > IPIM Hayward , Benjamin > JCIK Ramos , Donald > KFJE Roberts , Yvonne > KPGBBAAA Moore , > LCLABAAA Whitaker, Lettie > MGME Sharp , Michael > MIGBBAAA Montgomery , Jesenia > MPDK Lopez , Isabel > NEOM Powell , Linda > NKPC Shaffer , Sergio > NOCK Vargas , James > OGJEBAAA Owens , Denice > {noformat} > The issue is with the "concat" function in Spark SQ
[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289358#comment-15289358 ] JESSE CHEN commented on SPARK-15372: I will close this next, not a problem with Spark SQL. > TPC-DS Qury 84 returns wrong results against TPC official > - > > Key: SPARK-15372 > URL: https://issues.apache.org/jira/browse/SPARK-15372 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Labels: SPARK-15071 > > The official TPC-DS query 84 returns wrong results when compared to its > official answer set. > The query itself is: > {noformat} > select c_customer_id as customer_id >,concat(c_last_name , ', ' , c_first_name) as customername > from customer > ,customer_address > ,customer_demographics > ,household_demographics > ,income_band > ,store_returns > where ca_city = 'Edgewood' >and c_current_addr_sk = ca_address_sk >and ib_lower_bound >= 38128 >and ib_upper_bound <= 38128 + 5 >and ib_income_band_sk = hd_income_band_sk >and cd_demo_sk = c_current_cdemo_sk >and hd_demo_sk = c_current_hdemo_sk >and sr_cdemo_sk = cd_demo_sk > order by c_customer_id > limit 100; > {noformat} > Spark 2.0 build 0517 returned the following result: > {noformat} > AIPG Carter, Rodney > AKMBBAAA Mcarthur, Emma > CBNHBAAA Wells, Ron > DBME Vera, Tina > DBME Vera, Tina > DHKGBAAA Scott, Pamela > EIIBBAAA Atkins, Susan > FKAH Batiste, Ernest > GHMA Mitchell, Gregory > IAODBAAA Murray, Karen > IEOK Solomon, Clyde > IIBO Owens, David > IPDC Wallace, Eric > IPIM Hayward, Benjamin > JCIK Ramos, Donald > KFJE Roberts, Yvonne > KPGBBAAA NULL < ??? questionable row > LCLABAAA Whitaker, Lettie > MGME Sharp, Michael > MIGBBAAA Montgomery, Jesenia > MPDK Lopez, Isabel > NEOM Powell, Linda > NKPC Shaffer, Sergio > NOCK Vargas, James > OGJEBAAA Owens, Denice > {noformat} > Official answer set (which is correct!) > {noformat} > AIPG Carter , Rodney > AKMBBAAA Mcarthur, Emma > CBNHBAAA Wells , Ron > DBME Vera, Tina > DBME Vera, Tina > DHKGBAAA Scott , Pamela > EIIBBAAA Atkins , Susan > FKAH Batiste , Ernest > GHMA Mitchell, Gregory > IAODBAAA Murray , Karen > IEOK Solomon , Clyde > IIBO Owens , David > IPDC Wallace , Eric > IPIM Hayward , Benjamin > JCIK Ramos , Donald > KFJE Roberts , Yvonne > KPGBBAAA Moore , > LCLABAAA Whitaker, Lettie > MGME Sharp , Michael > MIGBBAAA Montgomery , Jesenia > MPDK Lopez , Isabel > NEOM Powell , Linda > NKPC Shaffer , Sergio > NOCK Vargas , James > OGJEBAAA Owens , Denice > {noformat} > The issue is with the "concat" function in Spark SQL (also behaves the same > in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. > But is this right? When I concatenate a person's last name and first name, if > the first name is missing (empty string or NULL), I should see the last name > still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. > Simplest repeatable test: > {noformat} > hive> select c_first_name, c_last_name from customer where c_customer_id = > 'KPGBBAAA'; > OK > NULL Moore > Time taken: 0.07 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_first_name) from customer where > c_customer_id = 'KPGBBAAA'; > OK > NULL > Time taken: 0.1 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, c_first_name) from customer where > c_customer_id = 'KPGBBAAA'; > OK
[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15372: --- Description: The official TPC-DS query 84 returns wrong results when compared to its official answer set. The query itself is: {noformat} select c_customer_id as customer_id ,concat(c_last_name , ', ' , c_first_name) as customername from customer ,customer_address ,customer_demographics ,household_demographics ,income_band ,store_returns where ca_city = 'Edgewood' and c_current_addr_sk = ca_address_sk and ib_lower_bound >= 38128 and ib_upper_bound <= 38128 + 5 and ib_income_band_sk = hd_income_band_sk and cd_demo_sk = c_current_cdemo_sk and hd_demo_sk = c_current_hdemo_sk and sr_cdemo_sk = cd_demo_sk order by c_customer_id limit 100; {noformat} Spark 2.0 build 0517 returned the following result: {noformat} AIPGCarter, Rodney AKMBBAAAMcarthur, Emma CBNHBAAAWells, Ron DBMEVera, Tina DBMEVera, Tina DHKGBAAAScott, Pamela EIIBBAAAAtkins, Susan FKAHBatiste, Ernest GHMAMitchell, Gregory IAODBAAAMurray, Karen IEOKSolomon, Clyde IIBOOwens, David IPDCWallace, Eric IPIMHayward, Benjamin JCIKRamos, Donald KFJERoberts, Yvonne KPGBBAAANULL < ??? questionable row LCLABAAAWhitaker, Lettie MGMESharp, Michael MIGBBAAAMontgomery, Jesenia MPDKLopez, Isabel NEOMPowell, Linda NKPCShaffer, Sergio NOCKVargas, James OGJEBAAAOwens, Denice {noformat} Official answer set (which is correct!) {noformat} AIPG Carter, Rodney AKMBBAAA Mcarthur , Emma CBNHBAAA Wells , Ron DBME Vera , Tina DBME Vera , Tina DHKGBAAA Scott , Pamela EIIBBAAA Atkins, Susan FKAH Batiste , Ernest GHMA Mitchell , Gregory IAODBAAA Murray, Karen IEOK Solomon , Clyde IIBO Owens , David IPDC Wallace , Eric IPIM Hayward , Benjamin JCIK Ramos , Donald KFJE Roberts , Yvonne KPGBBAAA Moore , LCLABAAA Whitaker , Lettie MGME Sharp , Michael MIGBBAAA Montgomery, Jesenia MPDK Lopez , Isabel NEOM Powell, Linda NKPC Shaffer , Sergio NOCK Vargas, James OGJEBAAA Owens , Denice {noformat} The issue is with the "concat" function in Spark SQL (also behaves the same in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But is this right? When I concatenate a person's last name and first name, if the first name is missing (empty string or NULL), I should see the last name still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. Simplest repeatable test: hive> select c_first_name, c_last_name from customer where c_customer_id = 'KPGBBAAA'; OK NULL Moore Time taken: 0.07 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.1 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.055 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.061 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_customer_id) from customer where c_customer_id = 'KPGBBAAA'; OK Moore, KPGBBAAA Same in 'spark-sql' shell: ... 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 select concat(c_last_name, c_first_name) from customer where
[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15372: --- Description: The official TPC-DS query 84 returns wrong results when compared to its official answer set. The query itself is: {noformat} select c_customer_id as customer_id ,concat(c_last_name , ', ' , c_first_name) as customername from customer ,customer_address ,customer_demographics ,household_demographics ,income_band ,store_returns where ca_city = 'Edgewood' and c_current_addr_sk = ca_address_sk and ib_lower_bound >= 38128 and ib_upper_bound <= 38128 + 5 and ib_income_band_sk = hd_income_band_sk and cd_demo_sk = c_current_cdemo_sk and hd_demo_sk = c_current_hdemo_sk and sr_cdemo_sk = cd_demo_sk order by c_customer_id limit 100; {noformat} Spark 2.0 build 0517 returned the following result: {noformat} AIPGCarter, Rodney AKMBBAAAMcarthur, Emma CBNHBAAAWells, Ron DBMEVera, Tina DBMEVera, Tina DHKGBAAAScott, Pamela EIIBBAAAAtkins, Susan FKAHBatiste, Ernest GHMAMitchell, Gregory IAODBAAAMurray, Karen IEOKSolomon, Clyde IIBOOwens, David IPDCWallace, Eric IPIMHayward, Benjamin JCIKRamos, Donald KFJERoberts, Yvonne KPGBBAAANULL < ??? questionable row LCLABAAAWhitaker, Lettie MGMESharp, Michael MIGBBAAAMontgomery, Jesenia MPDKLopez, Isabel NEOMPowell, Linda NKPCShaffer, Sergio NOCKVargas, James OGJEBAAAOwens, Denice {noformat} Official answer set (which is correct!) {noformat} AIPG Carter, Rodney AKMBBAAA Mcarthur , Emma CBNHBAAA Wells , Ron DBME Vera , Tina DBME Vera , Tina DHKGBAAA Scott , Pamela EIIBBAAA Atkins, Susan FKAH Batiste , Ernest GHMA Mitchell , Gregory IAODBAAA Murray, Karen IEOK Solomon , Clyde IIBO Owens , David IPDC Wallace , Eric IPIM Hayward , Benjamin JCIK Ramos , Donald KFJE Roberts , Yvonne KPGBBAAA Moore , LCLABAAA Whitaker , Lettie MGME Sharp , Michael MIGBBAAA Montgomery, Jesenia MPDK Lopez , Isabel NEOM Powell, Linda NKPC Shaffer , Sergio NOCK Vargas, James OGJEBAAA Owens , Denice {noformat} The issue is with the "concat" function in Spark SQL (also behaves the same in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But is this right? When I concatenate a person's last name and first name, if the first name is missing (empty string or NULL), I should see the last name still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. Simplest repeatable test: {noformat} hive> select c_first_name, c_last_name from customer where c_customer_id = 'KPGBBAAA'; OK NULL Moore Time taken: 0.07 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.1 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.055 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'KPGBBAAA'; OK NULL Time taken: 0.061 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_customer_id) from customer where c_customer_id = 'KPGBBAAA'; OK Moore, KPGBBAAA Same in 'spark-sql' shell: ... 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 select concat(c_last_name, c_first_name) from cust
[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15372: --- Labels: SPARK-15071 (was: ) > TPC-DS Qury 84 returns wrong results against TPC official > - > > Key: SPARK-15372 > URL: https://issues.apache.org/jira/browse/SPARK-15372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Labels: SPARK-15071 > Fix For: 2.0.0 > > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && > (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subque
[jira] [Created] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official
JESSE CHEN created SPARK-15372: -- Summary: TPC-DS Qury 84 returns wrong results against TPC official Key: SPARK-15372 URL: https://issues.apache.org/jira/browse/SPARK-15372 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: JESSE CHEN Assignee: Herman van Hovell Priority: Critical Fix For: 2.0.0 The official TPC-DS query 41 fails with the following error: {noformat} Error in query: The correlated scalar subquery can only contain equality predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = Women) && ((i_color#41 = brown) || (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; {noformat} The output plans showed the following errors {noformat} == Parsed Logical Plan == 'GlobalLimit 100 +- 'LocalLimit 100 +- 'Sort ['i_product_name ASC], true +- 'Distinct +- 'Project ['i_product_name] +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 40))) && (scalar-subquery#1 [] > 0)) : +- 'SubqueryAlias scalar-subquery#1 [] : +- 'Project ['count(1) AS item_cnt#0] :+- 'Filter ((('i_manufact = 'i1.i_manufact) && ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && (('i_color = brown) || ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && (('i_color = orange) || ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && (('i_size = petite) || ('i_size = large || ((('i_category = Men) && (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large : +- 'UnresolvedRelation `item`, None +- 'UnresolvedRelation `item`, Some(i1) == Analyzed Logical Plan == i_product_name: string GlobalLimit 100 +- LocalLimit 100 +- Sort [i_product_name#24 ASC], true +- Distinct +- Project [i_product_name#24] +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 [(((i_manufact#39 = i_manufact#17) && (i_category#37 = Women) && ((i_color#42 = powder) || (i_color#42 = khaki))) && (((i_units#43 = Ounce) || (i_units#43 = Oz)) && ((i_size#40 = medium) || (i_size#40 = extra large || (((i_category#37 = Women) && ((i_color#42 = brown) || (i_color#42 = honeydew))) && (((i_units#43 = Bunch) || (i_units#43 = Ton)) && ((i_size#40 =
[jira] [Closed] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-15122. -- Verified successfully in 0508 build. Thanks! > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Fix For: 2.0.0 > > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && > (i_manufact_id#16
[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276600#comment-15276600 ] JESSE CHEN commented on SPARK-15122: works great! now all 99 queries pass! nicely done! > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Fix For: 2.0.0 > > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter (((
[jira] [Closed] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-14968. -- Resolution: Fixed Fix Version/s: 2.0.0 fixed per SPARK-14785 > TPC-DS query 1 resolved attribute(s) missing > > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > Fix For: 2.0.0 > > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > with customer_total_return as > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 > from customer_total_return ctr2 > where ctr1.ctr_store_sk = ctr2.ctr_store_sk) > and s_store_sk = ctr1.ctr_store_sk > and s_state = 'TN' > and ctr1.ctr_customer_sk = c_customer_sk > order by c_customer_id > limit 100 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271128#comment-15271128 ] JESSE CHEN commented on SPARK-15122: Query41 official version: {noformat} select distinct(i_product_name) from item i1 where i_manufact_id between 738 and 738+40 and (select count(*) as item_cnt from item where (i_manufact = i1.i_manufact and ((i_category = 'Women' and (i_color = 'powder' or i_color = 'khaki') and (i_units = 'Ounce' or i_units = 'Oz') and (i_size = 'medium' or i_size = 'extra large') ) or (i_category = 'Women' and (i_color = 'brown' or i_color = 'honeydew') and (i_units = 'Bunch' or i_units = 'Ton') and (i_size = 'N/A' or i_size = 'small') ) or (i_category = 'Men' and (i_color = 'floral' or i_color = 'deep') and (i_units = 'N/A' or i_units = 'Dozen') and (i_size = 'petite' or i_size = 'large') ) or (i_category = 'Men' and (i_color = 'light' or i_color = 'cornflower') and (i_units = 'Box' or i_units = 'Pound') and (i_size = 'medium' or i_size = 'extra large') ))) or (i_manufact = i1.i_manufact and ((i_category = 'Women' and (i_color = 'midnight' or i_color = 'snow') and (i_units = 'Pallet' or i_units = 'Gross') and (i_size = 'medium' or i_size = 'extra large') ) or (i_category = 'Women' and (i_color = 'cyan' or i_color = 'papaya') and (i_units = 'Cup' or i_units = 'Dram') and (i_size = 'N/A' or i_size = 'small') ) or (i_category = 'Men' and (i_color = 'orange' or i_color = 'frosted') and (i_units = 'Each' or i_units = 'Tbl') and (i_size = 'petite' or i_size = 'large') ) or (i_category = 'Men' and (i_color = 'forest' or i_color = 'ghost') and (i_units = 'Lb' or i_units = 'Bundle') and (i_size = 'medium' or i_size = 'extra large') > 0 order by i_product_name limit 100; {noformat} > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i
[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15122: --- Priority: Critical (was: Major) > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && > (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 > [(((i_manufact#39 =
[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15122: --- Description: The official TPC-DS query 41 fails with the following error: {noformat} Error in query: The correlated scalar subquery can only contain equality predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = Women) && ((i_color#41 = brown) || (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; {noformat} The output plans showed the following errors {noformat} == Parsed Logical Plan == 'GlobalLimit 100 +- 'LocalLimit 100 +- 'Sort ['i_product_name ASC], true +- 'Distinct +- 'Project ['i_product_name] +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 40))) && (scalar-subquery#1 [] > 0)) : +- 'SubqueryAlias scalar-subquery#1 [] : +- 'Project ['count(1) AS item_cnt#0] :+- 'Filter ((('i_manufact = 'i1.i_manufact) && ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && (('i_color = brown) || ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && (('i_color = orange) || ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && (('i_size = petite) || ('i_size = large || ((('i_category = Men) && (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large : +- 'UnresolvedRelation `item`, None +- 'UnresolvedRelation `item`, Some(i1) == Analyzed Logical Plan == i_product_name: string GlobalLimit 100 +- LocalLimit 100 +- Sort [i_product_name#24 ASC], true +- Distinct +- Project [i_product_name#24] +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 [(((i_manufact#39 = i_manufact#17) && (i_category#37 = Women) && ((i_color#42 = powder) || (i_color#42 = khaki))) && (((i_units#43 = Ounce) || (i_units#43 = Oz)) && ((i_size#40 = medium) || (i_size#40 = extra large || (((i_category#37 = Women) && ((i_color#42 = brown) || (i_color#42 = honeydew))) && (((i_units#43 = Bunch) || (i_units#43 = Ton)) && ((i_size#40 = N/A) || (i_size#40 = small) || i_category#37 = Men) && ((i_color#42 = floral) || (i_color#42 = deep))) && (((i_units#43 = N/A) || (i_units#43 = Dozen)) && ((i_size#40 = petite) || (i_size#40 = large || (((i_category#37 = Men) && ((i_color#42 = light) || (i_color#42 =
[jira] [Created] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
JESSE CHEN created SPARK-15122: -- Summary: TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates Key: SPARK-15122 URL: https://issues.apache.org/jira/browse/SPARK-15122 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: JESSE CHEN Hi I am testing on spark 2.0 but dont see an option to select it yet. TPC-DS query 23 fails with the compile error Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?]) line 4:33 cannot recognize input near '' '' '' in subquery source ; line 4 pos 33 I could narrow the error to an aggregation on a subquery. select max(csales) tpcds_cmax from (select sum(ss_quantity*ss_sales_price) csales from store_sales group by ss_customer_sk) ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-15122: --- Affects Version/s: (was: 1.6.1) 2.0.0 > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Hi I am testing on spark 2.0 but dont see an option to select it yet. > TPC-DS query 23 fails with the compile error > Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?]) > line 4:33 cannot recognize input near '' '' '' in subquery > source > ; line 4 pos 33 > I could narrow the error to an aggregation on a subquery. > select max(csales) tpcds_cmax > from (select sum(ss_quantity*ss_sales_price) csales > from store_sales > group by ss_customer_sk) ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Description: This is a regression from a week ago. Failed to generate plan for query 1 in TPCDS using 0427 build from people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. Was working in build from 0421. The error is: {noformat} 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd at CliDriver.java:376 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = ctr_store_sk#2); 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} {noformat} The query is: {noformat} with customer_total_return as (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_RETURN_AMT) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100 {noformat} was: This is a regression from a week ago. Failed to generate plan for query 1 in TPCDS using 0427 build from people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. Was working in build from 0421. The error is: {noformat} 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd at CliDriver.java:376 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = ctr_store_sk#2); 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} {noformat} The query is: {noformat} (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_RETURN_AMT) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100 {noformat} > TPC-DS query 1 resolved attribute(s) missing > > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > with customer_total_return as > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_tota
[jira] [Updated] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Summary: TPC-DS query 1 resolved attribute(s) missing (was: TPC-DS query 1 fails to generate plan) > TPC-DS query 1 resolved attribute(s) missing > > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 > from customer_total_return ctr2 > where ctr1.ctr_store_sk = ctr2.ctr_store_sk) > and s_store_sk = ctr1.ctr_store_sk > and s_state = 'TN' > and ctr1.ctr_customer_sk = c_customer_sk > order by c_customer_id > limit 100 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Description: This is a regression from a week ago. Failed to generate plan for query 1 in TPCDS using 0427 build from people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. Was working in build from 0421. The error is: {noformat} 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd at CliDriver.java:376 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = ctr_store_sk#2); 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/04/27 07:00:59 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} {noformat} The query is: {noformat} (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_RETURN_AMT) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100 {noformat} was: This is a regression from a week ago. Failed to generate plan for query 1 in TPCDS using 0427 build from people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. Was working in > TPC-DS query 1 fails to generate plan > - > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 > from customer_total_return ctr2 > where ctr1.ctr_store_sk = ctr2.ctr_store_sk) > and s_store_sk = ctr1.ctr_store_sk > and s_state = 'TN' > and ctr1.ctr_customer_sk = c_customer_sk > order by c_customer_id > limit 100 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Affects Version/s: (was: 1.6.1) 2.0.0 > TPC-DS query 1 fails to generate plan > - > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Priority: Critical (was: Major) > TPC-DS query 1 fails to generate plan > - > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14968: --- Description: This is a regression from a week ago. Failed to generate plan for query 1 in TPCDS using 0427 build from people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. Was working in was: Hi I am testing on spark 2.0 but dont see an option to select it yet. TPC-DS query 23 fails with the compile error Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?]) line 4:33 cannot recognize input near '' '' '' in subquery source ; line 4 pos 33 I could narrow the error to an aggregation on a subquery. select max(csales) tpcds_cmax from (select sum(ss_quantity*ss_sales_price) csales from store_sales group by ss_customer_sk) ; > TPC-DS query 1 fails to generate plan > - > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14968) TPC-DS query 1 fails to generate plan
JESSE CHEN created SPARK-14968: -- Summary: TPC-DS query 1 fails to generate plan Key: SPARK-14968 URL: https://issues.apache.org/jira/browse/SPARK-14968 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: JESSE CHEN Hi I am testing on spark 2.0 but dont see an option to select it yet. TPC-DS query 23 fails with the compile error Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?]) line 4:33 cannot recognize input near '' '' '' in subquery source ; line 4 pos 33 I could narrow the error to an aggregation on a subquery. select max(csales) tpcds_cmax from (select sum(ss_quantity*ss_sales_price) csales from store_sales group by ss_customer_sk) ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256836#comment-15256836 ] JESSE CHEN commented on SPARK-14521: This fix will allow us to use Kyro again (in spark-sql shell and spark-submit). Somehow the workaround described above did not work for me in spark 2.0. Did I miss something else? My workaround is actually use the java serializer for now, which sees performance hit. > StackOverflowError in Kryo when executing TPC-DS > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-14096. -- Resolution: Duplicate SPARK-14521 > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as avg_i_current_price >from item j group by j.i_category) tmp2 on tmp2.i_category = > i.i_category > where > i.i_current_price > 1.2 * tmp2.avg_i_current_price > group by a.ca_state > having count(*) >= 10 > order by cnt >limit 100; > {noformat} > query 38 (succeeded) > {noform
[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252597#comment-15252597 ] JESSE CHEN commented on SPARK-14096: But the simplest workaround is to use the spark.serializer org.apache.spark.serializer.JavaSerializer for now. > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as avg_i_current_price >from item j group by j.i_category) tmp2 on tmp2.i_category = > i.i_category > where > i.i_current_price > 1.2 * tmp2.avg_i_curren
[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252592#comment-15252592 ] JESSE CHEN commented on SPARK-14096: duplicate of SPARK-14521 > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as avg_i_current_price >from item j group by j.i_category) tmp2 on tmp2.i_category = > i.i_category > where > i.i_current_price > 1.2 * tmp2.avg_i_current_price > group by a.ca_state > having count(*) >= 10 > order by cnt >limit 100;
[jira] [Closed] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-14616. -- Resolution: Not A Problem > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > {code:title=tpcds q44} > select asceding.rnk, i1.i_product_name best_performing, i2.i_product_name > worst_performing > from(select * > from (select item_sk,rank() over (order by rank_col asc) rnk >from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col > from store_sales ss1 > where ss_store_sk = 4 > group by ss_item_sk > having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) > rank_col > from store_sales > where ss_store_sk = 4 > and ss_addr_sk is null > group by ss_store_sk))V1)V11 > where rnk < 11) asceding, > (select * > from (select item_sk,rank() over (order by rank_col desc) rnk >from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col > from store_sales ss1 > where ss_store_sk = 4 > group by ss_item_sk > having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) > rank_col > from store_sales > where ss_store_sk = 4 > and ss_addr_sk is null > group by ss_store_sk))V2)V21 > where rnk < 11) descending, > item i1, > item i2 > where asceding.rnk = descending.rnk > and i1.i_item_sk=asceding.item_sk > and i2.i_item_sk=descending.item_sk > order by asceding.rnk > limit 100; > {code} > {noformat} > bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages > com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 > --executor-cores 2 --database hadoopds1g -f q44.sql > {noformat} > {noformat} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- Project [item_sk#0,rank_col#1] >: +- Filter havingCondition#219: boolean >:+- TungstenAggregate(key=[ss_item_sk#12], > functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], > output=[havingCondition#219,item_sk#0,rank_col#1]) >: +- INPUT >+- Exchange hashpartitioning(ss_item_sk#12,200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[ss_item_sk#12], > functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], > output=[ss_item_sk#12,sum#612,count#613L]) > : +- Project [ss_item_sk#12,ss_net_profit#32] > :+- Filter (ss_store_sk#17 = 4) > : +- INPUT > +- Scan ParquetRelation: > hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] > InputPaths: > hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales, > PushedFilters: [EqualTo(ss_store_sk,4)] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at > org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) > at > org.apache.spark.rdd.
[jira] [Commented] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241860#comment-15241860 ] JESSE CHEN commented on SPARK-14616: Build from yesterday did not have this problem. Closing. > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > {code:title=tpcds q44} > select asceding.rnk, i1.i_product_name best_performing, i2.i_product_name > worst_performing > from(select * > from (select item_sk,rank() over (order by rank_col asc) rnk >from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col > from store_sales ss1 > where ss_store_sk = 4 > group by ss_item_sk > having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) > rank_col > from store_sales > where ss_store_sk = 4 > and ss_addr_sk is null > group by ss_store_sk))V1)V11 > where rnk < 11) asceding, > (select * > from (select item_sk,rank() over (order by rank_col desc) rnk >from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col > from store_sales ss1 > where ss_store_sk = 4 > group by ss_item_sk > having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) > rank_col > from store_sales > where ss_store_sk = 4 > and ss_addr_sk is null > group by ss_store_sk))V2)V21 > where rnk < 11) descending, > item i1, > item i2 > where asceding.rnk = descending.rnk > and i1.i_item_sk=asceding.item_sk > and i2.i_item_sk=descending.item_sk > order by asceding.rnk > limit 100; > {code} > {noformat} > bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages > com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 > --executor-cores 2 --database hadoopds1g -f q44.sql > {noformat} > {noformat} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- Project [item_sk#0,rank_col#1] >: +- Filter havingCondition#219: boolean >:+- TungstenAggregate(key=[ss_item_sk#12], > functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], > output=[havingCondition#219,item_sk#0,rank_col#1]) >: +- INPUT >+- Exchange hashpartitioning(ss_item_sk#12,200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[ss_item_sk#12], > functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], > output=[ss_item_sk#12,sum#612,count#613L]) > : +- Project [ss_item_sk#12,ss_net_profit#32] > :+- Filter (ss_store_sk#17 = 4) > : +- INPUT > +- Scan ParquetRelation: > hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] > InputPaths: > hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales, > PushedFilters: [EqualTo(ss_store_sk,4)] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at > org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) > at > org.apache.spark.sql.execution.SparkP
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Description: {code:title=tpcds q44} select asceding.rnk, i1.i_product_name best_performing, i2.i_product_name worst_performing from(select * from (select item_sk,rank() over (order by rank_col asc) rnk from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 4 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 4 and ss_addr_sk is null group by ss_store_sk))V1)V11 where rnk < 11) asceding, (select * from (select item_sk,rank() over (order by rank_col desc) rnk from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 4 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 4 and ss_addr_sk is null group by ss_store_sk))V2)V21 where rnk < 11) descending, item i1, item i2 where asceding.rnk = descending.rnk and i1.i_item_sk=asceding.item_sk and i2.i_item_sk=descending.item_sk order by asceding.rnk limit 100; {code} {noformat} bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 --executor-cores 2 --database hadoopds1g -f q44.sql {noformat} {noformat} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition, None +- WholeStageCodegen : +- Project [item_sk#0,rank_col#1] : +- Filter havingCondition#219: boolean :+- TungstenAggregate(key=[ss_item_sk#12], functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], output=[havingCondition#219,item_sk#0,rank_col#1]) : +- INPUT +- Exchange hashpartitioning(ss_item_sk#12,200), None +- WholeStageCodegen : +- TungstenAggregate(key=[ss_item_sk#12], functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], output=[ss_item_sk#12,sum#612,count#613L]) : +- Project [ss_item_sk#12,ss_net_profit#32] :+- Filter (ss_store_sk#17 = 4) : +- INPUT +- Scan ParquetRelation: hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] InputPaths: hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales, PushedFilters: [EqualTo(ss_store_sk,4)] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.InputAdapter.upstream(WholeStageCodegen.scala:176) at org.apache.spark.sql.execution.Filter.upstream(basicOperators.scala:73) at org.apache.spark.sql.execution.Project.upstream(basicOperators.scala:35) at org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:279) at org.apache.spark.sql.execution.SparkPlan$$anonfun$exec
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Environment: (was: spark 1.5.1 (official binary distribution) running on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)) > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > {code:title=/tmp/bug.py} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext() > sqlc = SQLContext(sc) > R = Row('id', 'foo') > r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) > q = sqlc.createDataFrame(sc.parallelize([R('', > 'bar')])) > q.write.parquet('/tmp/1.parq') > q = sqlc.read.parquet('/tmp/1.parq') > j = r.join(q, r.id == q.id) > print j.count() > {code} > {noformat} > [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py > [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq > {noformat} > {noformat} > 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in > 119.90324 ms > Traceback (most recent call last): > File "/tmp/bug.py", line 13, in > print j.count() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 268, in count > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, > in deco > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, > tree: > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#10L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#13L]) >TungstenProject > BroadcastHashJoin [id#0], [id#8], BuildRight > TungstenProject [id#0] > Scan PhysicalRDD[id#0,foo#1] > ConvertToUnsafe > Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Note this happens only under following condition: > # executor memory >= 32GB (doesn't fail with up to 31 GB) > # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or > more then 24 chars) > # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14616: --- Affects Version/s: (was: 1.5.1) 2.0.0 > TreeNodeException running Q44 and 58 on Parquet tables > -- > > Key: SPARK-14616 > URL: https://issues.apache.org/jira/browse/SPARK-14616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 1.5.1 (official binary distribution) running on > hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8) >Reporter: JESSE CHEN > > {code:title=/tmp/bug.py} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext() > sqlc = SQLContext(sc) > R = Row('id', 'foo') > r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) > q = sqlc.createDataFrame(sc.parallelize([R('', > 'bar')])) > q.write.parquet('/tmp/1.parq') > q = sqlc.read.parquet('/tmp/1.parq') > j = r.join(q, r.id == q.id) > print j.count() > {code} > {noformat} > [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py > [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq > {noformat} > {noformat} > 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in > 119.90324 ms > Traceback (most recent call last): > File "/tmp/bug.py", line 13, in > print j.count() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 268, in count > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, > in deco > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, > tree: > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#10L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#13L]) >TungstenProject > BroadcastHashJoin [id#0], [id#8], BuildRight > TungstenProject [id#0] > Scan PhysicalRDD[id#0,foo#1] > ConvertToUnsafe > Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Note this happens only under following condition: > # executor memory >= 32GB (doesn't fail with up to 31 GB) > # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or > more then 24 chars) > # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) ---
[jira] [Created] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables
JESSE CHEN created SPARK-14616: -- Summary: TreeNodeException running Q44 and 58 on Parquet tables Key: SPARK-14616 URL: https://issues.apache.org/jira/browse/SPARK-14616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: spark 1.5.1 (official binary distribution) running on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8) Reporter: JESSE CHEN {code:title=/tmp/bug.py} from pyspark import SparkContext from pyspark.sql import SQLContext, Row sc = SparkContext() sqlc = SQLContext(sc) R = Row('id', 'foo') r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')])) q = sqlc.createDataFrame(sc.parallelize([R('', 'bar')])) q.write.parquet('/tmp/1.parq') q = sqlc.read.parquet('/tmp/1.parq') j = r.join(q, r.id == q.id) print j.count() {code} {noformat} [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq {noformat} {noformat} 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 119.90324 ms Traceback (most recent call last): File "/tmp/bug.py", line 13, in print j.count() File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 268, in count File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o148.count. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#10L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#13L]) TungstenProject BroadcastHashJoin [id#0], [id#8], BuildRight TungstenProject [id#0] Scan PhysicalRDD[id#0,foo#1] ConvertToUnsafe Scan ParquetRelation[hdfs:///tmp/1.parq][id#8] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {noformat} Note this happens only under following condition: # executor memory >= 32GB (doesn't fail with up to 31 GB) # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or more then 24 chars) # q is read from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235649#comment-15235649 ] JESSE CHEN commented on SPARK-13860: [~tsuresh] Could you inform the correct course of action here please? > TPCDS query 39 returns wrong results compared to TPC official result set > - > > Key: SPARK-13860 > URL: https://issues.apache.org/jira/browse/SPARK-13860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 39 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > q39a - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) ; q39b > - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) > Actual results 39a: > {noformat} > [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208] > [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977] > [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454] > [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416] > [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604] > [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382] > [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286] > [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812] > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733] > [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557] > [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564] > [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064] > [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702] > [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768] > [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466] > [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784] > [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149] > [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678] > [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254] > [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555] > [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515] > [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718] > [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428] > [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928] > [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328] > [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966] > [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432] > [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107] > [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363] > [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408] > [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988] > [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882] > [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107] > [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183] > [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321] > [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093] > [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725] > [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028] > [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852] > [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959] > [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601] > [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258] > [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555] > [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061] > [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155] > [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774] > [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956] > [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804] > [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526] > [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678] > [1,15647,1,201.66,1.2857931876095743
[jira] [Closed] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13307. -- Resolution: Fixed Fix Version/s: 2.0.0 Thanks. > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Fix For: 2.0.0 > > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235635#comment-15235635 ] JESSE CHEN commented on SPARK-13307: Performance back on track on Spark 2.0. Closing this. > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-14318. -- Resolution: Not A Problem > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > Attachments: threaddump-1459461915668.tdump > > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out > {noformat} > The Spark console output: > {noformat} > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage > 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage > 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage > 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage > 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage > 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) > 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage > 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage > 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) > 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 > on executor id: 2 hostname: bigaperf137.svl.ibm.com. > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage > 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) > {noformat} > Notice that time durations between tasks are unusually long: 2~5 minutes. > When looking at the Linux 'perf' tool, two top CPU consumers are: > 86.48%java [unknown] > 12.41%libjvm.so > Using the Java hotspot profiling tools, I am able to show what hotspot > methods are (top 5): > {noformat} > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() > 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms > 9,654,179 ms > org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms > (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms > org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 > 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() >4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms > 2,153,910 ms > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() > 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms > 19,967,510 ms > {noformat} > So as you can see, the test has been running for 1.5 hours...with 46% CPU > spent in the > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. > The stacks for top two are: > {noformat} > Marshalling > I > java/io/DataOutputStream.writeInt() line 197 > org.apache.spark.sql > I > org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() > line 60 > org.apache.spark.storage > I > org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 > org.apache.spark.shuffle > I > org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 > org.apache.spark.sc
[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221943#comment-15221943 ] JESSE CHEN commented on SPARK-14318: intersect should be used. Investigating further. > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > Attachments: threaddump-1459461915668.tdump > > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out > {noformat} > The Spark console output: > {noformat} > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage > 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage > 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage > 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage > 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage > 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) > 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage > 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage > 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) > 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 > on executor id: 2 hostname: bigaperf137.svl.ibm.com. > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage > 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) > {noformat} > Notice that time durations between tasks are unusually long: 2~5 minutes. > When looking at the Linux 'perf' tool, two top CPU consumers are: > 86.48%java [unknown] > 12.41%libjvm.so > Using the Java hotspot profiling tools, I am able to show what hotspot > methods are (top 5): > {noformat} > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() > 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms > 9,654,179 ms > org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms > (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms > org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 > 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() >4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms > 2,153,910 ms > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() > 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms > 19,967,510 ms > {noformat} > So as you can see, the test has been running for 1.5 hours...with 46% CPU > spent in the > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. > The stacks for top two are: > {noformat} > Marshalling > I > java/io/DataOutputStream.writeInt() line 197 > org.apache.spark.sql > I > org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() > line 60 > org.apache.spark.storage > I > org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 > org.apache.spark.shuffle > I > org/apache/spark/shu
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Attachment: threaddump-1459461915668.tdump here is the thread dump taken during the high CPU usage on the executor. > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > Attachments: threaddump-1459461915668.tdump > > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out > {noformat} > The Spark console output: > {noformat} > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage > 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage > 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage > 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage > 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage > 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) > 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage > 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage > 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) > 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 > on executor id: 2 hostname: bigaperf137.svl.ibm.com. > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage > 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) > {noformat} > Notice that time durations between tasks are unusually long: 2~5 minutes. > When looking at the Linux 'perf' tool, two top CPU consumers are: > 86.48%java [unknown] > 12.41%libjvm.so > Using the Java hotspot profiling tools, I am able to show what hotspot > methods are (top 5): > {noformat} > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() > 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms > 9,654,179 ms > org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms > (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms > org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 > 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() >4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms > 2,153,910 ms > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() > 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms > 19,967,510 ms > {noformat} > So as you can see, the test has been running for 1.5 hours...with 46% CPU > spent in the > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. > The stacks for top two are: > {noformat} > Marshalling > I > java/io/DataOutputStream.writeInt() line 197 > org.apache.spark.sql > I > org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() > line 60 > org.apache.spark.storage > I > org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 > org.apache.spark.shuffle > I > org/apa
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms **(46.8%)**9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I org/apache/spark/u
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I org/apache/spark/unsafe/Pl
[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220864#comment-15220864 ] JESSE CHEN commented on SPARK-14318: Q14 is as follows: {noformat} with cross_items as (select i_item_sk ss_item_sk from item JOIN (select brand_id, class_id, category_id from (select iss.i_brand_id brand_id ,iss.i_class_id class_id ,iss.i_category_id category_id from store_sales ,item iss ,date_dim d1 where ss_item_sk = iss.i_item_sk and ss_sold_date_sk = d1.d_date_sk and d1.d_year between 1999 AND 1999 + 2) x1 JOIN (select ics.i_brand_id ,ics.i_class_id ,ics.i_category_id from catalog_sales ,item ics ,date_dim d2 where cs_item_sk = ics.i_item_sk and cs_sold_date_sk = d2.d_date_sk and d2.d_year between 1999 AND 1999 + 2) x2 ON x1.brand_id = x2.i_brand_id and x1.class_id = x2.i_class_id and x1.category_id = x2.i_category_id JOIN (select iws.i_brand_id ,iws.i_class_id ,iws.i_category_id from web_sales ,item iws ,date_dim d3 where ws_item_sk = iws.i_item_sk and ws_sold_date_sk = d3.d_date_sk and d3.d_year between 1999 AND 1999 + 2) x3 ON x1.brand_id = x3.i_brand_id and x1.class_id = x3.i_class_id and x1.category_id = x3.i_category_id ) x4 where i_brand_id = x4.brand_id and i_class_id = x4.class_id and i_category_id = x4.category_id ), avg_sales as (select avg(quantity*list_price) average_sales from (select ss_quantity quantity ,ss_list_price list_price from store_sales ,date_dim where ss_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2 union all select cs_quantity quantity ,cs_list_price list_price from catalog_sales ,date_dim where cs_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2 union all select ws_quantity quantity ,ws_list_price list_price from web_sales ,date_dim where ws_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2) x) select * from (select 'store' channel, i_brand_id,i_class_id,i_category_id ,sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales from store_sales ss1 JOIN item ON ss1.ss_item_sk = i_item_sk JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk JOIN avg_sales JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq where dd2.d_year = 1999 + 1 and dd2.d_moy = 12 and dd2.d_dom = 11 group by average_sales,i_brand_id,i_class_id,i_category_id having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) this_year, (select 'store' channel, i_brand_id,i_class_id ,i_category_id, sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales from store_sales ss1 JOIN item ON ss1.ss_item_sk = i_item_sk JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk JOIN avg_sales JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq where dd2.d_year = 1999 and dd2.d_moy = 12 and dd2.d_dom = 11 group by average_sales, i_brand_id,i_class_id,i_category_id having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) last_year where this_year.i_brand_id= last_year.i_brand_id and this_year.i_class_id = last_year.i_class_id and this_year.i_category_id = last_year.i_category_id order by this_year.channel, this_year.i_brand_id, this_year.i_class_id, this_year.i_category_id limit 100 {noformat} > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I org/apache/spark/unsafe/Pl
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Labels: hangs (was: tpcds-result-mismatch) > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 | 2039 | > | Bad cards must make. | ABDA | 1717 | 1782 | > | Bad cards must mak
[jira] [Created] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
JESSE CHEN created SPARK-14318: -- Summary: TPCDS query 14 causes Spark SQL to hang Key: SPARK-14318 URL: https://issues.apache.org/jira/browse/SPARK-14318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL missing at least one row (grep for ABDA) ; I believe 2 other rows are missing as well. Actual results: {noformat} [null,AABD,2565,1922] [null,AAHD,2956,2052] [null,AALA,2042,1793] [null,ACGC,2373,1771] [null,ACKC,2321,1856] [null,ACOB,1504,1397] [null,ADKB,1820,2163] [null,AEAD,2631,1965] [null,AEOC,1659,1798] [null,AFAC,1965,1705] [null,AFAD,1769,1313] [null,AHDE,2700,1985] [null,AHHA,1578,1082] [null,AIEC,1756,1804] [null,AIMC,3603,2951] [null,AJAC,2109,1989] [null,AJKB,2573,3540] [null,ALBE,3458,2992] [null,ALCE,1720,1810] [null,ALEC,2569,1946] [null,ALNB,2552,1750] [null,ANFE,2022,2269] [null,AOIB,2982,2540] [null,APJB,2344,2593] [null,BAPD,2182,2787] [null,BDCE,2844,2069] [null,BDDD,2417,2537] [null,BDJA,1584,1666] [null,BEOD,2141,2649] [null,BFCC,2745,2020] [null,BFMB,1642,1364] [null,BHPC,1923,1780] [null,BIDB,1956,2836] [null,BIGB,2023,2344] [null,BIJB,1977,2728] [null,BJFE,1891,2390] [null,BLDE,1983,1797] [null,BNID,2485,2324] [null,BNLD,2385,2786] [null,BOMB,2291,2092] [null,CAAA,2233,2560] [null,CBCD,1540,2012] [null,CBIA,2394,2122] [null,CBPB,1790,1661] [null,CCMD,2654,2691] [null,CDBC,1804,2072] [null,CFEA,1941,1567] [null,CGFD,2123,2265] [null,CHPC,2933,2174] [null,CIGD,2618,2399] [null,CJCB,2728,2367] [null,CJLA,1350,1732] [null,CLAE,2578,2329] [null,CLGA,1842,1588] [null,CLLB,3418,2657] [null,CLOB,3115,2560] [null,CMAD,1991,2243] [null,CMJA,1261,1855] [null,CMLA,3288,2753] [null,CMPD,1320,1676] [null,CNGB,2340,2118] [null,CNHD,3519,3348] [null,CNPC,2561,1948] [null,DCPC,2664,2627] [null,DDHA,1313,1926] [null,DDND,1109,835] [null,DEAA,2141,1847] [null,DEJA,3142,2723] [null,DFKB,1470,1650] [null,DGCC,2113,2331] [null,DGFC,2201,2928] [null,DHPA,2467,2133] [null,DMBA,3085,2087] [null,DPAB,3494,3081] [null,EAEC,2133,2148] [null,EAPA,1560,1275] [null,ECGC,2815,3307] [null,EDPD,2731,1883] [null,EEEC,2024,1902] [null,EEMC,2624,2387] [null,EFFA,2047,1878] [null,EGJA,2403,2633] [null,EGMA,2784,2772] [null,EGOC,2389,1753] [null,EHFD,1940,1420] [null,EHLB,2320,2057] [null,EHPA,1898,1853] [null,EIPB,2930,2326] [null,EJAE,2582,1836] [null,EJIB,2257,1681] [null,EJJA,2791,1941] [null,EJJD,3410,2405] [null,EJNC,2472,2067] [null,EJPD,1219,1229] [null,EKEB,2047,1713] [null,EMEA,2502,1897] [null,EMKC,2362,2042] [null,ENAC,2011,1909] [null,ENFB,2507,2162] [null,ENOD,3371,2709] {noformat} Expected results: {noformat} +--+--++---+ | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | +--+--++---+ | Bad cards must make. | AACD | 1889 | 2168 | | Bad cards must make. | AAHD | 2739 | 2039 | | Bad cards must make. | ABDA | 1717 | 1782 | | Bad cards must make. | ACGC | 2296 | 2276 | | Bad cards must make. | ACKC | 2443 | 1878 | | Bad cards must make. | ACOB | 2705 | 2428 | | Bad cards must make. | ADGB | 2242 | 2759 | | Bad cards must make. | ADKB | 2138 | 2456 | | Bad cards must make. | AEAD | 2914 | 2237 | | Bad cards must make. | AEOC | 1797 | 2073 | | Bad
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Affects Version/s: 2.0.0 > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 | 2039 | > | Bad cards must make. | ABDA | 1717 | 1782 | > | Bad cards must make. | ACGCAA
[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218937#comment-15218937 ] JESSE CHEN commented on SPARK-13820: We are able to run 93 now. We should shoot for all 99. And this JIRA will fix 2 more :) > TPC-DS Query 10 fails to compile > > > Key: SPARK-13820 > URL: https://issues.apache.org/jira/browse/SPARK-13820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 10 fails to compile with the following error. > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Query is pasted here for easy reproduction > select > cd_gender, > cd_marital_status, > cd_education_status, > count(*) cnt1, > cd_purchase_estimate, > count(*) cnt2, > cd_credit_rating, > count(*) cnt3, > cd_dep_count, > count(*) cnt4, > cd_dep_employed_count, > count(*) cnt5, > cd_dep_college_count, > count(*) cnt6 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = > ss_wh1.ss_customer_sk > where > ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana > County','La Porte County') and >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > web_sales.ws_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216500#comment-15216500 ] JESSE CHEN commented on SPARK-13862: PR fixed the issue. New result is ordered correctly. {noformat}catalog 17543 0.57142857142857142857 1 1 catalog 14513 0.63541667 2 2 catalog 12577 0.65591397849462365591 3 3 catalog 34110.71641791044776119403 4 4 catalog 361 0.74647887323943661972 5 5 catalog 81890.74698795180722891566 6 6 catalog 89290.7625 7 7 catalog 14869 0.7717391304347826087 8 8 catalog 92950.77894736842105263158 9 9 catalog 16215 0.79069767441860465116 10 10 store 94710.775 1 1 store 97970.8 2 2 store 12641 0.81609195402298850575 3 3 store 15839 0.81632653061224489796 4 4 store 11710.82417582417582417582 5 5 store 11589 0.82653061224489795918 6 6 store 66610.92207792207792207792 7 7 store 13013 0.94202898550724637681 8 8 store 14925 0.96470588235294117647 9 9 store 90291 10 10 store 40631 10 10 web 75390.591 1 web 33370.62650602409638554217 2 2 web 15597 0.66197183098591549296 3 3 web 29150.69863013698630136986 4 4 web 11933 0.71717171717171717172 5 5 web 33050.7375 6 16 web 483 0.8 7 6 web 85 0.85714285714285714286 8 7 web 97 0.9036144578313253012 9 8 web 117 0.925 10 9 web 52990.92708333 11 10 {noformat} > TPCDS query 49 returns wrong results compared to TPC official result set > - > > Key: SPARK-13862 > URL: https://issues.apache.org/jira/browse/SPARK-13862 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 49 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > store,9797,0.8000,2,2] > [store,12641,0.81609195402298850575,3,3] > [store,6661,0.92207792207792207792,7,7] > [store,13013,0.94202898550724637681,8,8] > [store,9029,1.,10,10] > [web,15597,0.66197183098591549296,3,3] > [store,14925,0.96470588235294117647,9,9] > [store,4063,1.,10,10] > [catalog,8929,0.7625,7,7] > [store,11589,0.82653061224489795918,6,6] > [store,1171,0.82417582417582417582,5,5] > [store,9471,0.7750,1,1] > [catalog,12577,0.65591397849462365591,3,3] > [web,97,0.90361445783132530120,9,8] > [web,85,0.85714285714285714286,8,7] > [catalog,361,0.74647887323943661972,5,5] > [web,2915,0.69863013698630136986,4,4] > [web,117,0.9250,10,9] > [catalog,9295,0.77894736842105263158,9,9] > [web,3305,0.7375,6,16] > [catalog,16215,0.79069767441860465116,10,10] > [web,7539,0.5900,1,1] > [catalog,17543,0.57142857142857142857,1,1] > [catalog,3411,0.71641791044776119403,4,4] > [web,11933,0.71717171717171717172,5,5] > [catalog,14513,0.63541667,2,2] > [store,15839,0.81632653061224489796,4,4] > [web,3337,0.62650602409638554217,2,2] > [web,5299,0.92708333,11,10] > [catalog,8189,0.74698795180722891566,6,6] > [catalog,14869,0.77173913043478260870,8,8] > [web,483,0.8000,7,6] > {noformat} > Expected results: > {noformat} > +-+---++-+---+ > | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | > +-+---++-+---+ > | catalog | 17543 | .5714285714285714 | 1 | 1 | > | catalog | 14513 | .63541666 | 2 | 2 | > | catalog | 12577 | .6559139784946236 | 3 | 3 | > | catalog | 3411 | .7164179104477611 | 4 | 4 | > | catalog | 361 | .7464788732394366 | 5 | 5 | > | catalog | 8189 | .7469879518072289 | 6 | 6 | > | catalog | 8929 | .7625 | 7 | 7 | > | catalog | 14869 | .7717391304347826 | 8 | 8 | > | catalog | 9295 | .7789473684210526 | 9 | 9 | > | catalog | 16215 | .7906976744186046 | 10 |10 | > | store | 9471 | .7750 | 1 | 1 | > | store | 9797 | .8000 | 2 |
[jira] [Closed] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13862. -- PR fixed this issue. Thanks, [~smilegator] > TPCDS query 49 returns wrong results compared to TPC official result set > - > > Key: SPARK-13862 > URL: https://issues.apache.org/jira/browse/SPARK-13862 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 49 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > store,9797,0.8000,2,2] > [store,12641,0.81609195402298850575,3,3] > [store,6661,0.92207792207792207792,7,7] > [store,13013,0.94202898550724637681,8,8] > [store,9029,1.,10,10] > [web,15597,0.66197183098591549296,3,3] > [store,14925,0.96470588235294117647,9,9] > [store,4063,1.,10,10] > [catalog,8929,0.7625,7,7] > [store,11589,0.82653061224489795918,6,6] > [store,1171,0.82417582417582417582,5,5] > [store,9471,0.7750,1,1] > [catalog,12577,0.65591397849462365591,3,3] > [web,97,0.90361445783132530120,9,8] > [web,85,0.85714285714285714286,8,7] > [catalog,361,0.74647887323943661972,5,5] > [web,2915,0.69863013698630136986,4,4] > [web,117,0.9250,10,9] > [catalog,9295,0.77894736842105263158,9,9] > [web,3305,0.7375,6,16] > [catalog,16215,0.79069767441860465116,10,10] > [web,7539,0.5900,1,1] > [catalog,17543,0.57142857142857142857,1,1] > [catalog,3411,0.71641791044776119403,4,4] > [web,11933,0.71717171717171717172,5,5] > [catalog,14513,0.63541667,2,2] > [store,15839,0.81632653061224489796,4,4] > [web,3337,0.62650602409638554217,2,2] > [web,5299,0.92708333,11,10] > [catalog,8189,0.74698795180722891566,6,6] > [catalog,14869,0.77173913043478260870,8,8] > [web,483,0.8000,7,6] > {noformat} > Expected results: > {noformat} > +-+---++-+---+ > | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | > +-+---++-+---+ > | catalog | 17543 | .5714285714285714 | 1 | 1 | > | catalog | 14513 | .63541666 | 2 | 2 | > | catalog | 12577 | .6559139784946236 | 3 | 3 | > | catalog | 3411 | .7164179104477611 | 4 | 4 | > | catalog | 361 | .7464788732394366 | 5 | 5 | > | catalog | 8189 | .7469879518072289 | 6 | 6 | > | catalog | 8929 | .7625 | 7 | 7 | > | catalog | 14869 | .7717391304347826 | 8 | 8 | > | catalog | 9295 | .7789473684210526 | 9 | 9 | > | catalog | 16215 | .7906976744186046 | 10 |10 | > | store | 9471 | .7750 | 1 | 1 | > | store | 9797 | .8000 | 2 | 2 | > | store | 12641 | .8160919540229885 | 3 | 3 | > | store | 15839 | .8163265306122448 | 4 | 4 | > | store | 1171 | .8241758241758241 | 5 | 5 | > | store | 11589 | .8265306122448979 | 6 | 6 | > | store | 6661 | .9220779220779220 | 7 | 7 | > | store | 13013 | .9420289855072463 | 8 | 8 | > | store | 14925 | .9647058823529411 | 9 | 9 | > | store | 4063 | 1. | 10 |10 | > | store | 9029 | 1. | 10 |10 | > | web | 7539 | .5900 | 1 | 1 | > | web | 3337 | .6265060240963855 | 2 | 2 | > | web | 15597 | .6619718309859154 | 3 | 3 | > | web | 2915 | .6986301369863013 | 4 | 4 | > | web | 11933 | .7171717171717171 | 5 | 5 | > | web | 3305 | .7375 | 6 |16 | > | web | 483 | .8000 | 7 | 6 | > | web |85 | .8571428571428571 | 8 | 7 | > | web |97 | .9036144578313253 | 9 | 8 | > | web | 117 | .9250 | 10 | 9 | > | web | 5299 | .92708333 | 11 |10 | > +-+---++-+---+ > {noformat}
[jira] [Closed] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13864. -- PR fixed the issue. Nice work, [~smilegator] > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Miller] > [LJNCBAAA,George,Gamez] > [LIOP,Derek,Allen] > [EADJ,Ruth,Carroll] > [JGMM,Richard,Larson] > [PKIK,Wendy,Horvath] > [FJHF,Larissa,Roy] > [EPOG,Felisha,Mendes] > [EKJL,Aisha,Carlson] > [HNFH,Rebecca,Wilson] > [IBFCBAAA,Ruth,Grantham] > [OPDL,Ann,Pence] > [NIPL,Eric,Lawrence] > [OCIC,Zachary,Pennington] > [OFLC,James,Taylor] > [GEHI,Tyler,Miller] > [CADP,Cristobal,Thomas] > [JIAL,Santos,Gutierrez] > [PMMBBAAA,Paul,Jordan] > [DIIO,David,Carroll] > [DFKABAAA,Latoya,Craft] > [HMOI,Grace,Henderson] > [PPIBBAAA,Candice,Lee] > [JONHBAAA,Warren,Orozco] > [GNDA,Terry,Mcdowell] > [CIJM,Elizabeth,Thomas] > [DIJGBAAA,Ruth,Sanders] > [NFBDBAAA,Vernice,Fernandez] > [IDKF,Michael,Mack] > [IMHB,Kathy,Knowles] > [LHMC,Brooke,Nelson] > [CFCGBAAA,Marcus,Sanders] > [NJHCBAAA,Christopher,Schreiber] > [PDFB,Terrance,Banks] > [ANFA,Philip,Banks] > [IADEBAAA,Diane,Aldridge] > [ICHF,Linda,Mccoy] > [CFEN,Christopher,Dawson] > [KOJJ,Gracie,Mendoza] > [FOJA,Don,Castillo] > [FGPG,Albert,Wadsworth] > [KJBK,Georgia,Scott] > [EKFP,Annika,Chin] > [IBAEBAAA,Sandra,Wilson] > [MFFL,Margret,Gray] > [KNAK,Gladys,Banks] > [CJDI,James,Kerr] > [OBADBAAA,Elizabeth,Burnham] > [AMGD,Kenneth,Harlan] > [HJLA,Audrey,Beltran] > [AOPFBAAA,Jerry,Fields] > [CNAGBAAA,Virginia,May] > [HGOABAAA,Sonia,White] > [KBCABAAA,Debra,Bell] > [NJAG,Allen,Hood] > [MMOBBAAA,Margaret,Smith] > [NGDBBAAA,Carlos,Jewell] > [FOGI,Michelle,Greene] > [JEKFBAAA,Norma,Burkholder] > [OCAJ,Jenna,Staton] > [PFCL,Felicia,Neville] > [DLHBBAAA,Henry,Bertrand] > [DBEFBAAA,Bennie,Bowers] > [DCKO,Robert,Gonzalez] > [KKGE,Katie,Dunbar] > [GFMDBAAA,Kathleen,Gibson] > [IJEM,Charlie,Cummings] > [KJBL,Kerry,Davis] > [JKBN,Julie,Kern] > [MDCA,Louann,Hamel] > [EOAK,Molly,Benjamin] > [IBHH,Jennifer,Ballard] > [PJEN,Ashley,Norton] > [KLHHBAAA,Manuel,Castaneda] > [IMHHBAAA,Lillian,Davidson] > [GHPBBAAA,Nick,Mendez] > [BNBB,Irma,Smith] > [FBAH,Michael,Williams] > [PEHEBAAA,Edith,Molina] > [FMHI,Emilio,Darling] > [KAEC,Milton,Mackey] > [OCDJ,Nina,Sanchez] > [FGIG,Eduardo,Miller] > [FHACBAAA,null,null] > [HMJN,Ryan,Baptiste] > [HHCABAAA,William,Stewart] > {noformat} > Expected results: > {noformat} > +--+-++ > | CUSTOMER_ID | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME | > +--+-++ > | AMGD | Kenneth | Harlan | > | ANFA | Philip | Banks | > | AOPFBAAA | Jerry | Fields | > | BLEIBAAA | Paula | Wakefield | > | BNBB | Irma| Smith | > | CADP | Cristobal | Thomas | > | CFCGBAAA | Marcus | Sanders| > | CFEN | Christopher | Dawson | > | CIJM | Eliz
[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216483#comment-15216483 ] JESSE CHEN commented on SPARK-13864: Validated successfully. Returned the correct result set in order: {noformat} AMGDKenneth Harlan ANFAPhilip Banks AOPFBAAAJerry Fields BLEIBAAAPaula Wakefield BNBBIrmaSmith CADPCristobal Thomas CFCGBAAAMarcus Sanders CFENChristopher Dawson CIJMElizabeth Thomas CJDIJames Kerr CNAGBAAAVirginiaMay DBEFBAAABennie Bowers DCKORobert Gonzalez DFIEBAAAJohnGray DFKABAAALatoya Craft DIIODavid Carroll DIJGBAAARuthSanders DLHBBAAAHenry Bertrand EADJRuthCarroll EJDLAlice Wright EKFPAnnika Chin EKJLAisha Carlson EOAKMolly Benjamin EPOGFelisha Mendes FACEPriscilla Miller FBAHMichael Williams FGIGEduardo Miller FGPGAlbert Wadsworth FHACBAAA FJHFLarissa Roy FMHIEmilio Darling FOGIMichelleGreene FOJADon Castillo GEHITyler Miller GFMDBAAAKathleenGibson GHPBBAAANickMendez GNDATerry Mcdowell HGOABAAASonia White HHCABAAAWilliam Stewart HJLAAudrey Beltran HMJNRyanBaptiste HMOIGrace Henderson HNFHRebecca Wilson IADEBAAADiane Aldridge IBAEBAAASandra Wilson IBFCBAAARuthGrantham IBHHJenniferBallard ICHFLinda Mccoy IDKFMichael Mack IJEMCharlie Cummings IMHBKathy Knowles IMHHBAAALillian Davidson JEKFBAAANorma Burkholder JGMMRichard Larson JIALSantos Gutierrez JKBNJulie Kern JONHBAAAWarren Orozco KAECMilton Mackey KBCABAAADebra Bell KJBKGeorgia Scott KJBLKerry Davis KKGEKatie Dunbar KLHHBAAAManuel Castaneda KNAKGladys Banks KOJJGracie Mendoza LFKKIgnacio Miller LHMCBrooke Nelson LIOPDerek Allen LJNCBAAAGeorge Gamez MDCALouann Hamel MFFLMargret Gray MMOBBAAAMargaretSmith NFBDBAAAVernice Fernandez NGDBBAAACarlos Jewell NIPLEricLawrence NJAGAllen Hood NJHCBAAAChristopher Schreiber OBADBAAAElizabeth Burnham OCAJJenna Staton OCDJNinaSanchez OCICZachary Pennington OCLBBAAA OFLCJames Taylor OPDLAnn Pence PDFBTerranceBanks PEHEBAAAEdith Molina PFCLFelicia Neville PJENAshley Norton PKBCBAAAAndrea White PKIKWendy Horvath PMMBBAAAPaulJordan PPIBBAAACandice Lee {noformat} > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Mil
[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error
[ https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214834#comment-15214834 ] JESSE CHEN commented on SPARK-13831: Same in spark 2.0. Query 41 also returns the same error. > TPC-DS Query 35 fails with the following compile error > -- > > Key: SPARK-13831 > URL: https://issues.apache.org/jira/browse/SPARK-13831 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Roy Cecil > > TPC-DS Query 35 fails with the following compile error. > Scala.NotImplementedError: > scala.NotImplementedError: No parse rules for ASTNode type: 864, text: > TOK_SUBQUERY_EXPR : > TOK_SUBQUERY_EXPR 1, 439,797, 1370 > TOK_SUBQUERY_OP 1, 439,439, 1370 > exists 1, 439,439, 1370 > TOK_QUERY 1, 441,797, 1508 > Pasting Query 35 for easy reference. > select > ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > count(*) cnt1, > min(cd_dep_count) cd_dep_count1, > max(cd_dep_count) cd_dep_count2, > avg(cd_dep_count) cd_dep_count3, > cd_dep_employed_count, > count(*) cnt2, > min(cd_dep_employed_count) cd_dep_employed_count1, > max(cd_dep_employed_count) cd_dep_employed_count2, > avg(cd_dep_employed_count) cd_dep_employed_count3, > cd_dep_college_count, > count(*) cnt3, > min(cd_dep_college_count) cd_dep_college_count1, > max(cd_dep_college_count) cd_dep_college_count2, > avg(cd_dep_college_count) cd_dep_college_count3 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN > (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_qoy < 4) ss_wh1 > ON c.c_customer_sk = ss_wh1.ss_customer_sk > where >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > ws_sold_date_sk = d_date_sk and > d_year = 2002 and > d_qoy < 4 >UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > cs_sold_date_sk = d_date_sk and > d_year = 2002 and > d_qoy < 4 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214795#comment-15214795 ] JESSE CHEN commented on SPARK-13820: Happens in Spark 2.0 as well, in Spark 2.0 error is {noformat} == Error == scala.NotImplementedError: [Expression]: No parse rules for ASTNode type: 918, tree: TOK_SUBQUERY_EXPR 22, 172, 292, 2 :- TOK_SUBQUERY_OP 22, 172, 172, 2 : +- exists 22, 172, 172, 2 +- TOK_QUERY 23, 174, 292, 15 {noformat} This feature affects a few TPCDS queries (so far 93 queries work...getting really close to 99 here). > TPC-DS Query 10 fails to compile > > > Key: SPARK-13820 > URL: https://issues.apache.org/jira/browse/SPARK-13820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 10 fails to compile with the following error. > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( > TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155) > at > org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177) > Query is pasted here for easy reproduction > select > cd_gender, > cd_marital_status, > cd_education_status, > count(*) cnt1, > cd_purchase_estimate, > count(*) cnt2, > cd_credit_rating, > count(*) cnt3, > cd_dep_count, > count(*) cnt4, > cd_dep_employed_count, > count(*) cnt5, > cd_dep_college_count, > count(*) cnt6 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = > ss_wh1.ss_customer_sk > where > ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana > County','La Porte County') and >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > web_sales.ws_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and > d_year = 2002 and > d_moy between 1 and 1+3 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by cd_gender, > cd_marital_status, > cd_education_status, > cd_purchase_estimate, > cd_credit_rating, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14096: --- Labels: (was: tpcds-result-mismatch) > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as avg_i_current_price >from item j group by j.i_category) tmp2 on tmp2.i_category = > i.i_category > where > i.i_current_price > 1.2 * tmp2.avg_i_current_price > group by a.ca_state > having count(*) >= 10 > order by cnt >limit 100; > {noformat} > query 38 (succeeded) > {
[jira] [Updated] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14096: --- Affects Version/s: (was: 1.6.0) 2.0.0 Description: Trying to run TPCDS query 06 in spark-sql shell received the following error in the middle of a stage; but running another query 38 succeeded: NPE: {noformat} 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 10.0 (TID 622) in 171 ms on localhost (30/200) 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (org.apache.spark.util.BoundedPriorityQueue) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) at java.util.PriorityQueue.offer(PriorityQueue.java:344) at java.util.PriorityQueue.add(PriorityQueue.java:321) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 15 more 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 10.0 (TID 623) in 171 ms on localhost (31/200) 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool {noformat} query 06 (caused the above NPE): {noformat} select a.ca_state state, count(*) cnt from customer_address a join customer c on a.ca_address_sk = c.c_current_addr_sk join store_sales s on c.c_customer_sk = s.ss_customer_sk join date_dim d on s.ss_sold_date_sk = d.d_date_sk join item i on s.ss_item_sk = i.i_item_sk join (select distinct d_month_seq from date_dim where d_year = 2001 and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq join (select j.i_category, avg(j.i_current_price) as avg_i_current_price from item j group by j.i_category) tmp2 on tmp2.i_category = i.i_category where i.i_current_price > 1.2 * tmp2.avg_i_current_price group by a.ca_state having count(*) >= 10 order by cnt limit 100; {noformat} query 38 (succeeded) {noformat} select count(*) from ( select distinct c_last_name, c_first_name, d_date from store_sales, date_dim, customer where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200 + 11 intersect select distinct c_last_name, c_first_name, d_date
[jira] [Created] (SPARK-14096) SPARK-SQL CLI returns NPE
JESSE CHEN created SPARK-14096: -- Summary: SPARK-SQL CLI returns NPE Key: SPARK-14096 URL: https://issues.apache.org/jira/browse/SPARK-14096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL has right answer but in wrong order (and there is an 'order by' in the query). Actual results: {noformat} store,9797,0.8000,2,2] [store,12641,0.81609195402298850575,3,3] [store,6661,0.92207792207792207792,7,7] [store,13013,0.94202898550724637681,8,8] [store,9029,1.,10,10] [web,15597,0.66197183098591549296,3,3] [store,14925,0.96470588235294117647,9,9] [store,4063,1.,10,10] [catalog,8929,0.7625,7,7] [store,11589,0.82653061224489795918,6,6] [store,1171,0.82417582417582417582,5,5] [store,9471,0.7750,1,1] [catalog,12577,0.65591397849462365591,3,3] [web,97,0.90361445783132530120,9,8] [web,85,0.85714285714285714286,8,7] [catalog,361,0.74647887323943661972,5,5] [web,2915,0.69863013698630136986,4,4] [web,117,0.9250,10,9] [catalog,9295,0.77894736842105263158,9,9] [web,3305,0.7375,6,16] [catalog,16215,0.79069767441860465116,10,10] [web,7539,0.5900,1,1] [catalog,17543,0.57142857142857142857,1,1] [catalog,3411,0.71641791044776119403,4,4] [web,11933,0.71717171717171717172,5,5] [catalog,14513,0.63541667,2,2] [store,15839,0.81632653061224489796,4,4] [web,3337,0.62650602409638554217,2,2] [web,5299,0.92708333,11,10] [catalog,8189,0.74698795180722891566,6,6] [catalog,14869,0.77173913043478260870,8,8] [web,483,0.8000,7,6] {noformat} Expected results: {noformat} +-+---++-+---+ | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | +-+---++-+---+ | catalog | 17543 | .5714285714285714 | 1 | 1 | | catalog | 14513 | .63541666 | 2 | 2 | | catalog | 12577 | .6559139784946236 | 3 | 3 | | catalog | 3411 | .7164179104477611 | 4 | 4 | | catalog | 361 | .7464788732394366 | 5 | 5 | | catalog | 8189 | .7469879518072289 | 6 | 6 | | catalog | 8929 | .7625 | 7 | 7 | | catalog | 14869 | .7717391304347826 | 8 | 8 | | catalog | 9295 | .7789473684210526 | 9 | 9 | | catalog | 16215 | .7906976744186046 | 10 |10 | | store | 9471 | .7750 | 1 | 1 | | store | 9797 | .8000 | 2 | 2 | | store | 12641 | .8160919540229885 | 3 | 3 | | store | 15839 | .8163265306122448 | 4 | 4 | | store | 1171 | .8241758241758241 | 5 | 5 | | store | 11589 | .8265306122448979 | 6 | 6 | | store | 6661 | .9220779220779220 | 7 | 7 | | store | 13013 | .9420289855072463 | 8 | 8 | | store | 14925 | .9647058823529411 | 9 | 9 | | store | 4063 | 1. | 10 |10 | | store | 9029 | 1. | 10 |10 | | web | 7539 | .5900 | 1 | 1 | | web | 3337 | .6265060240963855 | 2 | 2 | | web | 15597 | .6619718309859154 | 3 | 3 | | web | 2915 | .6986301369863013 | 4 | 4 | | web | 11933 | .7171717171717171 | 5 | 5 | | web | 3305 | .7375 | 6 |16 | | web | 483 | .8000 | 7 | 6 | | web |85 | .8571428571428571 | 8 | 7 | | web |97 | .9036144578313253 | 9 | 8 | | web | 117 | .9250 | 10 | 9 | | web | 5299 | .92708333 | 11 |10 | +-+---++-+---+ {noformat} Query used: {noformat} -- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION select 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select ws.ws_item_sk as item
[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207239#comment-15207239 ] JESSE CHEN commented on SPARK-13864: Tried on two recent builds having issues running to completion. Something is broken. Looking into why... > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > {noformat} > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Miller] > [LJNCBAAA,George,Gamez] > [LIOP,Derek,Allen] > [EADJ,Ruth,Carroll] > [JGMM,Richard,Larson] > [PKIK,Wendy,Horvath] > [FJHF,Larissa,Roy] > [EPOG,Felisha,Mendes] > [EKJL,Aisha,Carlson] > [HNFH,Rebecca,Wilson] > [IBFCBAAA,Ruth,Grantham] > [OPDL,Ann,Pence] > [NIPL,Eric,Lawrence] > [OCIC,Zachary,Pennington] > [OFLC,James,Taylor] > [GEHI,Tyler,Miller] > [CADP,Cristobal,Thomas] > [JIAL,Santos,Gutierrez] > [PMMBBAAA,Paul,Jordan] > [DIIO,David,Carroll] > [DFKABAAA,Latoya,Craft] > [HMOI,Grace,Henderson] > [PPIBBAAA,Candice,Lee] > [JONHBAAA,Warren,Orozco] > [GNDA,Terry,Mcdowell] > [CIJM,Elizabeth,Thomas] > [DIJGBAAA,Ruth,Sanders] > [NFBDBAAA,Vernice,Fernandez] > [IDKF,Michael,Mack] > [IMHB,Kathy,Knowles] > [LHMC,Brooke,Nelson] > [CFCGBAAA,Marcus,Sanders] > [NJHCBAAA,Christopher,Schreiber] > [PDFB,Terrance,Banks] > [ANFA,Philip,Banks] > [IADEBAAA,Diane,Aldridge] > [ICHF,Linda,Mccoy] > [CFEN,Christopher,Dawson] > [KOJJ,Gracie,Mendoza] > [FOJA,Don,Castillo] > [FGPG,Albert,Wadsworth] > [KJBK,Georgia,Scott] > [EKFP,Annika,Chin] > [IBAEBAAA,Sandra,Wilson] > [MFFL,Margret,Gray] > [KNAK,Gladys,Banks] > [CJDI,James,Kerr] > [OBADBAAA,Elizabeth,Burnham] > [AMGD,Kenneth,Harlan] > [HJLA,Audrey,Beltran] > [AOPFBAAA,Jerry,Fields] > [CNAGBAAA,Virginia,May] > [HGOABAAA,Sonia,White] > [KBCABAAA,Debra,Bell] > [NJAG,Allen,Hood] > [MMOBBAAA,Margaret,Smith] > [NGDBBAAA,Carlos,Jewell] > [FOGI,Michelle,Greene] > [JEKFBAAA,Norma,Burkholder] > [OCAJ,Jenna,Staton] > [PFCL,Felicia,Neville] > [DLHBBAAA,Henry,Bertrand] > [DBEFBAAA,Bennie,Bowers] > [DCKO,Robert,Gonzalez] > [KKGE,Katie,Dunbar] > [GFMDBAAA,Kathleen,Gibson] > [IJEM,Charlie,Cummings] > [KJBL,Kerry,Davis] > [JKBN,Julie,Kern] > [MDCA,Louann,Hamel] > [EOAK,Molly,Benjamin] > [IBHH,Jennifer,Ballard] > [PJEN,Ashley,Norton] > [KLHHBAAA,Manuel,Castaneda] > [IMHHBAAA,Lillian,Davidson] > [GHPBBAAA,Nick,Mendez] > [BNBB,Irma,Smith] > [FBAH,Michael,Williams] > [PEHEBAAA,Edith,Molina] > [FMHI,Emilio,Darling] > [KAEC,Milton,Mackey] > [OCDJ,Nina,Sanchez] > [FGIG,Eduardo,Miller] > [FHACBAAA,null,null] > [HMJN,Ryan,Baptiste] > [HHCABAAA,William,Stewart] > {noformat} > Expected results: > {noformat} > +--+-++ > | CUSTOMER_ID | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME | > +--+-++ > | AMGD | Kenneth | Harlan | > | ANFA | Philip | Banks | > | AOPFBAAA | Jerry | Fields | > | BLEIBAAA | Paula | Wakefield | > | BNBB | Irma| Smith | > | CADP | Cristobal | Thomas | > | CFCGBAAA | Marcus
[jira] [Closed] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13858. -- Resolution: Not A Bug Schema updates generated correct results in both spark 1.6 and 2.0. Good to close. > TPCDS query 21 returns wrong results compared to TPC official result set > - > > Key: SPARK-13858 > URL: https://issues.apache.org/jira/browse/SPARK-13858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make.
[jira] [Closed] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13861. -- Resolution: Duplicate Fixed all schema discrepancies. > TPCDS query 40 returns wrong results compared to TPC official result set > - > > Key: SPARK-13861 > URL: https://issues.apache.org/jira/browse/SPARK-13861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 40 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABBD) ; I believe 5 > rows are missing in total. > Actual results: > {noformat} > [TN,AABD,0.0,-82.060899353] > [TN,AACD,-216.54000234603882,158.0399932861328] > [TN,AAHD,186.54999542236328,0.0] > [TN,AALA,0.0,48.2254223633] > [TN,ACGC,63.67999863624573,0.0] > [TN,ACHC,102.6830517578,51.8838964844] > [TN,ACKC,128.9235150146,44.8169482422] > [TN,ACLD,205.43999433517456,-948.619930267334] > [TN,ACOB,207.32000732421875,24.88389648438] > [TN,ACPD,87.75,53.9900016784668] > [TN,ADGB,44.310001373291016,222.4800033569336] > [TN,ADKB,0.0,-471.8699951171875] > [TN,AEAD,58.2400016784668,0.0] > [TN,AEOC,19.9084741211,214.7076293945] > [TN,AFAC,271.8199977874756,163.1699981689453] > [TN,AFAD,2.349046325684,28.3169482422] > [TN,AFDC,-378.0499496459961,-303.26999282836914] > [TN,AGID,307.6099967956543,-19.29915527344] > [TN,AHDE,80.574468689,-476.7200012207031] > [TN,AHHA,8.27457763672,155.1276565552] > [TN,AHJB,39.23999857902527,0.0] > [TN,AIEC,82.3675750732,3.910858306885] > [TN,AIEE,20.39618530273,-151.08999633789062] > [TN,AIMC,24.46313354492,-150.330517578] > [TN,AJAC,49.0915258789,82.084741211] > [TN,AJCA,121.18000221252441,63.779998779296875] > [TN,AJKB,27.94534057617,8.97267028809] > [TN,ALBE,88.2599983215332,30.22542236328] > [TN,ALCE,93.5245776367,92.0198092651] > [TN,ALEC,64.179019165,15.1584741211] > [TN,ALNB,4.19809265137,148.27000427246094] > [TN,AMBE,28.44534057617,0.0] > [TN,AMPB,0.0,131.92999839782715] > [TN,ANFE,0.0,-137.3400115966797] > [TN,AOIB,150.40999603271484,254.288058548] > [TN,APJB,45.2745776367,334.482015991] > [TN,APLA,50.2076293945,29.150001049041748] > [TN,APLD,0.0,32.3838964844] > [TN,BAPD,93.41999816894531,145.8699951171875] > [TN,BBID,296.774577637,30.95084472656] > [TN,BDCE,-1771.0800704956055,-54.779998779296875] > [TN,BDDD,111.12000274658203,280.5899963378906] > [TN,BDJA,0.0,79.5423706055] > [TN,BEFD,0.0,3.429475479126] > [TN,BEOD,269.838964844,297.5800061225891] > [TN,BFMB,110.82999801635742,-941.4000930786133] > [TN,BFNA,47.8661035156,0.0] > [TN,BFOC,46.3415258789,83.5245776367] > [TN,BHPC,27.378392334,77.61999893188477] > [TN,BIDB,196.6199951171875,5.57171661377] > [TN,BIGB,425.3399963378906,0.0] > [TN,BIJB,209.6300048828125,0.0] > [TN,BJFE,7.32923706055,55.1584741211] > [TN,BKFA,0.0,138.14000129699707] > [TN,BKMC,27.17076293945,54.970001220703125] > [TN,BLDE,170.28999400138855,0.0] > [TN,BNHB,58.0594277954,-337.8899841308594] > [TN,BNID,54.41525878906,35.01504089355] > [TN,BNLA,0.0,168.37999629974365] > [TN,BNLD,0.0,96.4084741211] > [TN,BNMC,202.40999698638916,49.52999830245972] > [TN,BOCC,4.73019073486,69.83999633789062] > [TN,BOMB,63.66999816894531,163.49000668525696] > [TN,CAAA,121.91000366210938,0.0] > [TN,CAAD,-1107.6099338531494,0.0] > [TN,CAJC,115.8046594238,173.0519073486] > [TN,CBCD,18.94534057617,226.38000106811523] > [TN,CBFA,0.0,97.41000366210938] > [TN,CBIA,2.14104904175,84.66000366210938] > [TN,CBPB,95.44000244140625,26.6830517578] > [TN,CCAB,160.43000602722168,135.8661035156] > [TN,CCHD,0.0,12
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200637#comment-15200637 ] JESSE CHEN commented on SPARK-13865: This maybe a TPC toolkit issue. Will be looking into this with John on my team who is one of the TPC board member. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200886#comment-15200886 ] JESSE CHEN commented on SPARK-13865: You rock! > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200869#comment-15200869 ] JESSE CHEN commented on SPARK-13865: I am onto that. Thanks. Also, good to know the parsing error is gone in 2.0. Can't wait to get my hands on that soon. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198558#comment-15198558 ] JESSE CHEN commented on SPARK-13858: Good job, Bo! I would like to test this on my cluster if you have a fix. > TPCDS query 21 returns wrong results compared to TPC official result set > - > > Key: SPARK-13858 > URL: https://issues.apache.org/jira/browse/SPARK-13858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > |
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200850#comment-15200850 ] JESSE CHEN commented on SPARK-13865: Hive, big sql, db2 queries are all generated off corresponding query templates. Hive apparently generated the one I listed in the initial report (with JOINs). So this is what I am asking TPC to see why the variant exists in the templates. Meanwhile, I tested the query, and as expected, Spark SQL isn't able to parse it with the following errors: {noformat} 16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select count(*) from ((select distinct c_last_name, c_first_name, d_datefrom store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) ) cool_cust NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518) {noformat} > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200850#comment-15200850 ] JESSE CHEN edited comment on SPARK-13865 at 3/18/16 2:24 AM: - Hive, big sql, db2 queries are all generated off corresponding query templates. Hive apparently generated the one I listed in the initial report (with JOINs). So this is what I am asking TPC to see why the variant exists in the templates. Meanwhile, I tested the query you found, and as expected, Spark SQL isn't able to parse it with the following errors: Query: {noformat} select count(*) from ((select distinct c_last_name, c_first_name, d_date from store_sales, date_dim, customer where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_date from catalog_sales, date_dim, customer where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_date from web_sales, date_dim, customer where web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) ) cool_cust ; {noformat} Error: {noformat} 16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select count(*) from ((select distinct c_last_name, c_first_name, d_datefrom store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) ) cool_cust NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518) {noformat} was (Author: jfc...@us.ibm.com): Hive, big sql, db2 queries are all generated off corresponding query templates. Hive apparently generated the one I listed in the initial report (with JOINs). So this is what I am asking TPC to see why the variant exists in the templates. Meanwhile, I tested the query, and as expected, Spark SQL isn't able to parse it with the following errors: {noformat} 16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select count(*) from ((select distinct c_last_name, c_first_name, d_datefrom store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) except (select distinct c_last_name, c_first_name, d_datefrom web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between 1200 and 1200+11) ) cool_cust NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)
[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200177#comment-15200177 ] JESSE CHEN commented on SPARK-13859: Tested both q87 and q38 on the lab's cluster. With this modification (i.e., null-safe equals), both q87 and q38 returned correct results (per TPC) on both text and parquet. Without this modification, both queries returned the wrong results. Per TPC rules on vendor-specific syntax: 4.2.3.4 The following query modifications are minor: c) Operators 2. Relational operators - Relational operators used in queries such as "<", ">", "<>", "<=", and "=", may be replaced by equivalent vendor-specific operators, for example ".LT.", ".GT.", "!=" or "^=", ".LE.", and "==", respectively. This proposed modification however seems outside of allowed modifcation because it is a workaround to an issue where "Spark does not deal with nulls correctly under certain conditions." If you look at other queries in TPC (which 72 of them returned correct results), there are this type of equals used all over. SO there is a inherent unsafe null operation in Spark that is **not related** to a) wrong table definition, or b) wrong query syntax, or c) file format. Spark should do this "=" correctly and automatically. These two queries provide excellent testcases for finding that bug and fixing it. Jesse > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-+ > | 1 | > +-+ > | 107 | > +-+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13859. -- Resolution: Not A Bug Fix Version/s: 2.0.0 Solution is to revert back to original TPC query with INTERSECT & EXCEPT and validated with correct return results in Spark 2.0. The null-safe version will remain a variant for this query (for Hive). internal toolkit defect open RTC 124749. > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > Fix For: 2.0.0 > > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-+ > | 1 | > +-+ > | 107 | > +-+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200877#comment-15200877 ] JESSE CHEN commented on SPARK-13865: I will open a bug against TPCDS toolkit for this. Will add bug report number here. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200149#comment-15200149 ] JESSE CHEN commented on SPARK-13863: Going to validate this also on my cluster. Nice find. > TPCDS query 66 returns wrong results compared to TPC official result set > - > > Key: SPARK-13863 > URL: https://issues.apache.org/jira/browse/SPARK-13863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 66 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Aggregations slightly off -- eg. JAN_SALES column of "Doors canno" row - > SparkSQL returns 6355232.185385704, expected 6355232.31 > Actual results: > {noformat} > [null,null,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7] > [Bad cards must make.,621234,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7] > [Conventional childr,977787,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7] > [Doors canno,294242,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7] > [Important issues liv,138504,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7] > {noformat} > Expected results: > {noformat} > +--+---+--+---+-+---
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198559#comment-15198559 ] JESSE CHEN commented on SPARK-13865: yes sir. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199885#comment-15199885 ] JESSE CHEN commented on SPARK-13859: Testing both Q87 and Q38. Back shortly with results. > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-+ > | 1 | > +-+ > | 107 | > +-+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202272#comment-15202272 ] JESSE CHEN commented on SPARK-13832: This is the vanilla TPC query: {noformat} select sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin ,i_category ,i_class ,grouping(i_category)+grouping(i_class) as lochierarchy ,rank() over ( partition by grouping(i_category)+grouping(i_class), case when grouping(i_class) = 0 then i_category end order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as rank_within_parent from store_sales ,date_dim d1 ,item ,store where d1.d_year = 2001 and d1.d_date_sk = ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and s_state in ('TN','TN','TN','TN', 'TN','TN','TN','TN') group by rollup(i_category,i_class) order by lochierarchy desc ,case when lochierarchy = 0 then i_category end ,rank_within_parent limit 100; {noformat} The query fails in spark 2.0 with the following error: {noformat} 6/03/18 15:09:37 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 16/03/18 15:09:37 ERROR scheduler.TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (org.apache.spark.util.BoundedPriorityQueue) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:311) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:65) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:56) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:56) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1789) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:55) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} With grouiping_id(), the query is: {noformat} select sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin ,i_category ,i_class ,grouping_id(i_category)+grouping_id(i_class) as lochierarchy ,rank() over ( partition by grouping_id(i_category)+grouping_id(i_class), case when grouping_id(i_class) = 0 then i_category end order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as rank_within_parent from store_sales ,date_dim d1 ,item ,store where d1.d_year = 2001 and d1.d_date_sk = ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and s_state in ('TN','TN','TN','TN', 'TN','TN','TN','TN') group by rollup(i_category,i_class) order by lochierarchy desc ,case when lochierarchy = 0 then i_category end ,rank_within_parent limit 100; -- end query 36 in stream 0 using template query36.tpl lochierarchy desc ,case when lochierarchy = 0 then i_category end ,rank_within_parent limit 100 {noformat} Returned error: {noformat} 16/03/18 15:13:01 INFO parser.ParseDriver: Parse completed. Error in query: Columns of grouping_id (i_category#674) does not match grouping columns (i_category#674,i_class#672); {noformat} Something still fails at logical plan generation. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.s
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202280#comment-15202280 ] JESSE CHEN commented on SPARK-13865: Solution is to revert back to original TPC query with INTERSECT & EXCEPT and validated with correct return results in Spark 2.0. The null-safe version will remain a variant for this query (for Hive). internal toolkit defect open RTC 124749. > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > Fix For: 2.0.0 > > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13865. -- Resolution: Not A Bug Fix Version/s: 2.0.0 > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > Fix For: 2.0.0 > > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-13863. -- Resolution: Workaround fixed schema. > TPCDS query 66 returns wrong results compared to TPC official result set > - > > Key: SPARK-13863 > URL: https://issues.apache.org/jira/browse/SPARK-13863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 66 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Aggregations slightly off -- eg. JAN_SALES column of "Doors canno" row - > SparkSQL returns 6355232.185385704, expected 6355232.31 > Actual results: > {noformat} > [null,null,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7] > [Bad cards must make.,621234,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7] > [Conventional childr,977787,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7] > [Doors canno,294242,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7] > [Important issues liv,138504,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7] > {noformat} > Expected results: > {noformat} > +--+---+--+---+-+---+---+--+++--
[jira] [Updated] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13859: --- Description: Testing Spark SQL using TPC queries. Query 38 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL returns count of 0, answer set reports 107. Actual results: [0] Expected: +-+ | 1 | +-+ | 107 | +-+ query used: -- start query 38 in stream 0 using template query38.tpl and seed QUALIFICATION select count(*) from ( select distinct c_last_name, c_first_name, d_date from store_sales JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200 + 11) tmp1 JOIN (select distinct c_last_name, c_first_name, d_date from catalog_sales JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk JOIN customer ON catalog_sales.cs_bill_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and (tmp1.d_date = tmp2.d_date) JOIN ( select distinct c_last_name, c_first_name, d_date from web_sales JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and (tmp1.d_date = tmp3.d_date) limit 100 ; -- end query 38 in stream 0 using template query38.tpl was: Testing Spark SQL using TPC queries. Query 38 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL returns count of 0, answer set reports 107. Actual results: [0] Expected: +-+ | 1 | +-+ | 107 | +-+ > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > [0] > Expected: > +-+ > | 1 | > +-+ > | 107 | > +-+ > query used: > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13858: --- Description: Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL missing at least one row (grep for ABDA) ; I believe 2 other rows are missing as well. Actual results: [null,AABD,2565,1922] [null,AAHD,2956,2052] [null,AALA,2042,1793] [null,ACGC,2373,1771] [null,ACKC,2321,1856] [null,ACOB,1504,1397] [null,ADKB,1820,2163] [null,AEAD,2631,1965] [null,AEOC,1659,1798] [null,AFAC,1965,1705] [null,AFAD,1769,1313] [null,AHDE,2700,1985] [null,AHHA,1578,1082] [null,AIEC,1756,1804] [null,AIMC,3603,2951] [null,AJAC,2109,1989] [null,AJKB,2573,3540] [null,ALBE,3458,2992] [null,ALCE,1720,1810] [null,ALEC,2569,1946] [null,ALNB,2552,1750] [null,ANFE,2022,2269] [null,AOIB,2982,2540] [null,APJB,2344,2593] [null,BAPD,2182,2787] [null,BDCE,2844,2069] [null,BDDD,2417,2537] [null,BDJA,1584,1666] [null,BEOD,2141,2649] [null,BFCC,2745,2020] [null,BFMB,1642,1364] [null,BHPC,1923,1780] [null,BIDB,1956,2836] [null,BIGB,2023,2344] [null,BIJB,1977,2728] [null,BJFE,1891,2390] [null,BLDE,1983,1797] [null,BNID,2485,2324] [null,BNLD,2385,2786] [null,BOMB,2291,2092] [null,CAAA,2233,2560] [null,CBCD,1540,2012] [null,CBIA,2394,2122] [null,CBPB,1790,1661] [null,CCMD,2654,2691] [null,CDBC,1804,2072] [null,CFEA,1941,1567] [null,CGFD,2123,2265] [null,CHPC,2933,2174] [null,CIGD,2618,2399] [null,CJCB,2728,2367] [null,CJLA,1350,1732] [null,CLAE,2578,2329] [null,CLGA,1842,1588] [null,CLLB,3418,2657] [null,CLOB,3115,2560] [null,CMAD,1991,2243] [null,CMJA,1261,1855] [null,CMLA,3288,2753] [null,CMPD,1320,1676] [null,CNGB,2340,2118] [null,CNHD,3519,3348] [null,CNPC,2561,1948] [null,DCPC,2664,2627] [null,DDHA,1313,1926] [null,DDND,1109,835] [null,DEAA,2141,1847] [null,DEJA,3142,2723] [null,DFKB,1470,1650] [null,DGCC,2113,2331] [null,DGFC,2201,2928] [null,DHPA,2467,2133] [null,DMBA,3085,2087] [null,DPAB,3494,3081] [null,EAEC,2133,2148] [null,EAPA,1560,1275] [null,ECGC,2815,3307] [null,EDPD,2731,1883] [null,EEEC,2024,1902] [null,EEMC,2624,2387] [null,EFFA,2047,1878] [null,EGJA,2403,2633] [null,EGMA,2784,2772] [null,EGOC,2389,1753] [null,EHFD,1940,1420] [null,EHLB,2320,2057] [null,EHPA,1898,1853] [null,EIPB,2930,2326] [null,EJAE,2582,1836] [null,EJIB,2257,1681] [null,EJJA,2791,1941] [null,EJJD,3410,2405] [null,EJNC,2472,2067] [null,EJPD,1219,1229] [null,EKEB,2047,1713] [null,EMEA,2502,1897] [null,EMKC,2362,2042] [null,ENAC,2011,1909] [null,ENFB,2507,2162] [null,ENOD,3371,2709] Expected results: +--+--++---+ | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | +--+--++---+ | Bad cards must make. | AACD | 1889 | 2168 | | Bad cards must make. | AAHD | 2739 | 2039 | | Bad cards must make. | ABDA | 1717 | 1782 | | Bad cards must make. | ACGC | 2296 | 2276 | | Bad cards must make. | ACKC | 2443 | 1878 | | Bad cards must make. | ACOB | 2705 | 2428 | | Bad cards must make. | ADGB | 2242 | 2759 | | Bad cards must make. | ADKB | 2138 | 2456 | | Bad cards must make. | AEAD | 2914 | 2237 | | Bad cards must make. | AEOC | 1797 | 2073 | | Bad cards must make. | AFAC | 2058 | 2734 | | Bad cards must make. | AFAD | 2173 | 2515 | | Bad cards must make. | AFDC | 2309 | 2277 |
[jira] [Updated] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13865: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > [47555] > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > Query used: > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 = tmp3.cln3) > and (tmp1.cfn1 = tmp3.cfn3) > and (tmp1.ddate1= tmp3.ddate3) > where > notnull2 is null and notnull3 is null > ; > -- end query 87 in stream 0 using template query87.tpl -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13864: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 74 returns wrong results compared to TPC official result set > - > > Key: SPARK-13864 > URL: https://issues.apache.org/jira/browse/SPARK-13864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 74 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Spark SQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > [BLEIBAAA,Paula,Wakefield] > [DFIEBAAA,John,Gray] > [OCLBBAAA,null,null] > [PKBCBAAA,Andrea,White] > [EJDL,Alice,Wright] > [FACE,Priscilla,Miller] > [LFKK,Ignacio,Miller] > [LJNCBAAA,George,Gamez] > [LIOP,Derek,Allen] > [EADJ,Ruth,Carroll] > [JGMM,Richard,Larson] > [PKIK,Wendy,Horvath] > [FJHF,Larissa,Roy] > [EPOG,Felisha,Mendes] > [EKJL,Aisha,Carlson] > [HNFH,Rebecca,Wilson] > [IBFCBAAA,Ruth,Grantham] > [OPDL,Ann,Pence] > [NIPL,Eric,Lawrence] > [OCIC,Zachary,Pennington] > [OFLC,James,Taylor] > [GEHI,Tyler,Miller] > [CADP,Cristobal,Thomas] > [JIAL,Santos,Gutierrez] > [PMMBBAAA,Paul,Jordan] > [DIIO,David,Carroll] > [DFKABAAA,Latoya,Craft] > [HMOI,Grace,Henderson] > [PPIBBAAA,Candice,Lee] > [JONHBAAA,Warren,Orozco] > [GNDA,Terry,Mcdowell] > [CIJM,Elizabeth,Thomas] > [DIJGBAAA,Ruth,Sanders] > [NFBDBAAA,Vernice,Fernandez] > [IDKF,Michael,Mack] > [IMHB,Kathy,Knowles] > [LHMC,Brooke,Nelson] > [CFCGBAAA,Marcus,Sanders] > [NJHCBAAA,Christopher,Schreiber] > [PDFB,Terrance,Banks] > [ANFA,Philip,Banks] > [IADEBAAA,Diane,Aldridge] > [ICHF,Linda,Mccoy] > [CFEN,Christopher,Dawson] > [KOJJ,Gracie,Mendoza] > [FOJA,Don,Castillo] > [FGPG,Albert,Wadsworth] > [KJBK,Georgia,Scott] > [EKFP,Annika,Chin] > [IBAEBAAA,Sandra,Wilson] > [MFFL,Margret,Gray] > [KNAK,Gladys,Banks] > [CJDI,James,Kerr] > [OBADBAAA,Elizabeth,Burnham] > [AMGD,Kenneth,Harlan] > [HJLA,Audrey,Beltran] > [AOPFBAAA,Jerry,Fields] > [CNAGBAAA,Virginia,May] > [HGOABAAA,Sonia,White] > [KBCABAAA,Debra,Bell] > [NJAG,Allen,Hood] > [MMOBBAAA,Margaret,Smith] > [NGDBBAAA,Carlos,Jewell] > [FOGI,Michelle,Greene] > [JEKFBAAA,Norma,Burkholder] > [OCAJ,Jenna,Staton] > [PFCL,Felicia,Neville] > [DLHBBAAA,Henry,Bertrand] > [DBEFBAAA,Bennie,Bowers] > [DCKO,Robert,Gonzalez] > [KKGE,Katie,Dunbar] > [GFMDBAAA,Kathleen,Gibson] > [IJEM,Charlie,Cummings] > [KJBL,Kerry,Davis] > [JKBN,Julie,Kern] > [MDCA,Louann,Hamel] > [EOAK,Molly,Benjamin] > [IBHH,Jennifer,Ballard] > [PJEN,Ashley,Norton] > [KLHHBAAA,Manuel,Castaneda] > [IMHHBAAA,Lillian,Davidson] > [GHPBBAAA,Nick,Mendez] > [BNBB,Irma,Smith] > [FBAH,Michael,Williams] > [PEHEBAAA,Edith,Molina] > [FMHI,Emilio,Darling] > [KAEC,Milton,Mackey] > [OCDJ,Nina,Sanchez] > [FGIG,Eduardo,Miller] > [FHACBAAA,null,null] > [HMJN,Ryan,Baptiste] > [HHCABAAA,William,Stewart] > Expected results: > +--+-++ > | CUSTOMER_ID | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME | > +--+-++ > | AMGD | Kenneth | Harlan | > | ANFA | Philip | Banks | > | AOPFBAAA | Jerry | Fields | > | BLEIBAAA | Paula | Wakefield | > | BNBB | Irma| Smith | > | CADP | Cristobal | Thomas | > | CFCGBAAA | Marcus | Sanders| > | CFEN | Christopher | Dawson | > | CIJM | Elizabeth | Thomas | >
[jira] [Updated] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13862: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 49 returns wrong results compared to TPC official result set > - > > Key: SPARK-13862 > URL: https://issues.apache.org/jira/browse/SPARK-13862 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 49 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > store,9797,0.8000,2,2] > [store,12641,0.81609195402298850575,3,3] > [store,6661,0.92207792207792207792,7,7] > [store,13013,0.94202898550724637681,8,8] > [store,9029,1.,10,10] > [web,15597,0.66197183098591549296,3,3] > [store,14925,0.96470588235294117647,9,9] > [store,4063,1.,10,10] > [catalog,8929,0.7625,7,7] > [store,11589,0.82653061224489795918,6,6] > [store,1171,0.82417582417582417582,5,5] > [store,9471,0.7750,1,1] > [catalog,12577,0.65591397849462365591,3,3] > [web,97,0.90361445783132530120,9,8] > [web,85,0.85714285714285714286,8,7] > [catalog,361,0.74647887323943661972,5,5] > [web,2915,0.69863013698630136986,4,4] > [web,117,0.9250,10,9] > [catalog,9295,0.77894736842105263158,9,9] > [web,3305,0.7375,6,16] > [catalog,16215,0.79069767441860465116,10,10] > [web,7539,0.5900,1,1] > [catalog,17543,0.57142857142857142857,1,1] > [catalog,3411,0.71641791044776119403,4,4] > [web,11933,0.71717171717171717172,5,5] > [catalog,14513,0.63541667,2,2] > [store,15839,0.81632653061224489796,4,4] > [web,3337,0.62650602409638554217,2,2] > [web,5299,0.92708333,11,10] > [catalog,8189,0.74698795180722891566,6,6] > [catalog,14869,0.77173913043478260870,8,8] > [web,483,0.8000,7,6] > Expected results: > +-+---++-+---+ > | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | > +-+---++-+---+ > | catalog | 17543 | .5714285714285714 | 1 | 1 | > | catalog | 14513 | .63541666 | 2 | 2 | > | catalog | 12577 | .6559139784946236 | 3 | 3 | > | catalog | 3411 | .7164179104477611 | 4 | 4 | > | catalog | 361 | .7464788732394366 | 5 | 5 | > | catalog | 8189 | .7469879518072289 | 6 | 6 | > | catalog | 8929 | .7625 | 7 | 7 | > | catalog | 14869 | .7717391304347826 | 8 | 8 | > | catalog | 9295 | .7789473684210526 | 9 | 9 | > | catalog | 16215 | .7906976744186046 | 10 |10 | > | store | 9471 | .7750 | 1 | 1 | > | store | 9797 | .8000 | 2 | 2 | > | store | 12641 | .8160919540229885 | 3 | 3 | > | store | 15839 | .8163265306122448 | 4 | 4 | > | store | 1171 | .8241758241758241 | 5 | 5 | > | store | 11589 | .8265306122448979 | 6 | 6 | > | store | 6661 | .9220779220779220 | 7 | 7 | > | store | 13013 | .9420289855072463 | 8 | 8 | > | store | 14925 | .9647058823529411 | 9 | 9 | > | store | 4063 | 1. | 10 |10 | > | store | 9029 | 1. | 10 |10 | > | web | 7539 | .5900 | 1 | 1 | > | web | 3337 | .6265060240963855 | 2 | 2 | > | web | 15597 | .6619718309859154 | 3 | 3 | > | web | 2915 | .6986301369863013 | 4 | 4 | > | web | 11933 | .7171717171717171 | 5 | 5 | > | web | 3305 | .7375 | 6 |16 | > | web | 483 | .8000 | 7 | 6 | > | web |85 | .8571428571428571 | 8 | 7 | > | web |97 | .9036144578313253 | 9 | 8 | > | web | 117 | .9250 | 10 | 9 | > | web | 5299 | .92708333 | 11 |10 | > +-+---++-+---+ > Query used: > -- start query 49 in stream 0 usin
[jira] [Updated] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13861: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 40 returns wrong results compared to TPC official result set > - > > Key: SPARK-13861 > URL: https://issues.apache.org/jira/browse/SPARK-13861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 40 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABBD) ; I believe 5 > rows are missing in total. > Actual results: > [TN,AABD,0.0,-82.060899353] > [TN,AACD,-216.54000234603882,158.0399932861328] > [TN,AAHD,186.54999542236328,0.0] > [TN,AALA,0.0,48.2254223633] > [TN,ACGC,63.67999863624573,0.0] > [TN,ACHC,102.6830517578,51.8838964844] > [TN,ACKC,128.9235150146,44.8169482422] > [TN,ACLD,205.43999433517456,-948.619930267334] > [TN,ACOB,207.32000732421875,24.88389648438] > [TN,ACPD,87.75,53.9900016784668] > [TN,ADGB,44.310001373291016,222.4800033569336] > [TN,ADKB,0.0,-471.8699951171875] > [TN,AEAD,58.2400016784668,0.0] > [TN,AEOC,19.9084741211,214.7076293945] > [TN,AFAC,271.8199977874756,163.1699981689453] > [TN,AFAD,2.349046325684,28.3169482422] > [TN,AFDC,-378.0499496459961,-303.26999282836914] > [TN,AGID,307.6099967956543,-19.29915527344] > [TN,AHDE,80.574468689,-476.7200012207031] > [TN,AHHA,8.27457763672,155.1276565552] > [TN,AHJB,39.23999857902527,0.0] > [TN,AIEC,82.3675750732,3.910858306885] > [TN,AIEE,20.39618530273,-151.08999633789062] > [TN,AIMC,24.46313354492,-150.330517578] > [TN,AJAC,49.0915258789,82.084741211] > [TN,AJCA,121.18000221252441,63.779998779296875] > [TN,AJKB,27.94534057617,8.97267028809] > [TN,ALBE,88.2599983215332,30.22542236328] > [TN,ALCE,93.5245776367,92.0198092651] > [TN,ALEC,64.179019165,15.1584741211] > [TN,ALNB,4.19809265137,148.27000427246094] > [TN,AMBE,28.44534057617,0.0] > [TN,AMPB,0.0,131.92999839782715] > [TN,ANFE,0.0,-137.3400115966797] > [TN,AOIB,150.40999603271484,254.288058548] > [TN,APJB,45.2745776367,334.482015991] > [TN,APLA,50.2076293945,29.150001049041748] > [TN,APLD,0.0,32.3838964844] > [TN,BAPD,93.41999816894531,145.8699951171875] > [TN,BBID,296.774577637,30.95084472656] > [TN,BDCE,-1771.0800704956055,-54.779998779296875] > [TN,BDDD,111.12000274658203,280.5899963378906] > [TN,BDJA,0.0,79.5423706055] > [TN,BEFD,0.0,3.429475479126] > [TN,BEOD,269.838964844,297.5800061225891] > [TN,BFMB,110.82999801635742,-941.4000930786133] > [TN,BFNA,47.8661035156,0.0] > [TN,BFOC,46.3415258789,83.5245776367] > [TN,BHPC,27.378392334,77.61999893188477] > [TN,BIDB,196.6199951171875,5.57171661377] > [TN,BIGB,425.3399963378906,0.0] > [TN,BIJB,209.6300048828125,0.0] > [TN,BJFE,7.32923706055,55.1584741211] > [TN,BKFA,0.0,138.14000129699707] > [TN,BKMC,27.17076293945,54.970001220703125] > [TN,BLDE,170.28999400138855,0.0] > [TN,BNHB,58.0594277954,-337.8899841308594] > [TN,BNID,54.41525878906,35.01504089355] > [TN,BNLA,0.0,168.37999629974365] > [TN,BNLD,0.0,96.4084741211] > [TN,BNMC,202.40999698638916,49.52999830245972] > [TN,BOCC,4.73019073486,69.83999633789062] > [TN,BOMB,63.66999816894531,163.49000668525696] > [TN,CAAA,121.91000366210938,0.0] > [TN,CAAD,-1107.6099338531494,0.0] > [TN,CAJC,115.8046594238,173.0519073486] > [TN,CBCD,18.94534057617,226.38000106811523] > [TN,CBFA,0.0,97.41000366210938] > [TN,CBIA,2.14104904175,84.66000366210938] > [TN,CBPB,95.44000244140625,26.6830517578] > [TN,CCAB,160.43000602722168,135.8661035156] > [TN,CCHD,0.0,121.62000274658203] > [TN,
[jira] [Updated] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13863: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 66 returns wrong results compared to TPC official result set > - > > Key: SPARK-13863 > URL: https://issues.apache.org/jira/browse/SPARK-13863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 66 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Aggregations slightly off -- eg. JAN_SALES column of "Doors canno" row - > SparkSQL returns 6355232.185385704, expected 6355232.31 > Actual results: > [null,null,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7] > [Bad cards must make.,621234,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7] > [Conventional childr,977787,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7] > [Doors canno,294242,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7] > [Important issues liv,138504,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7] > Expected results: > +--+---+--+---+-+---+---+--+++++
[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193800#comment-15193800 ] JESSE CHEN commented on SPARK-13862: tpcds-result-mismatch > TPCDS query 49 returns wrong results compared to TPC official result set > - > > Key: SPARK-13862 > URL: https://issues.apache.org/jira/browse/SPARK-13862 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > > Testing Spark SQL using TPC queries. Query 49 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL has right answer but in wrong order (and there is an 'order by' in > the query). > Actual results: > store,9797,0.8000,2,2] > [store,12641,0.81609195402298850575,3,3] > [store,6661,0.92207792207792207792,7,7] > [store,13013,0.94202898550724637681,8,8] > [store,9029,1.,10,10] > [web,15597,0.66197183098591549296,3,3] > [store,14925,0.96470588235294117647,9,9] > [store,4063,1.,10,10] > [catalog,8929,0.7625,7,7] > [store,11589,0.82653061224489795918,6,6] > [store,1171,0.82417582417582417582,5,5] > [store,9471,0.7750,1,1] > [catalog,12577,0.65591397849462365591,3,3] > [web,97,0.90361445783132530120,9,8] > [web,85,0.85714285714285714286,8,7] > [catalog,361,0.74647887323943661972,5,5] > [web,2915,0.69863013698630136986,4,4] > [web,117,0.9250,10,9] > [catalog,9295,0.77894736842105263158,9,9] > [web,3305,0.7375,6,16] > [catalog,16215,0.79069767441860465116,10,10] > [web,7539,0.5900,1,1] > [catalog,17543,0.57142857142857142857,1,1] > [catalog,3411,0.71641791044776119403,4,4] > [web,11933,0.71717171717171717172,5,5] > [catalog,14513,0.63541667,2,2] > [store,15839,0.81632653061224489796,4,4] > [web,3337,0.62650602409638554217,2,2] > [web,5299,0.92708333,11,10] > [catalog,8189,0.74698795180722891566,6,6] > [catalog,14869,0.77173913043478260870,8,8] > [web,483,0.8000,7,6] > Expected results: > +-+---++-+---+ > | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | > +-+---++-+---+ > | catalog | 17543 | .5714285714285714 | 1 | 1 | > | catalog | 14513 | .63541666 | 2 | 2 | > | catalog | 12577 | .6559139784946236 | 3 | 3 | > | catalog | 3411 | .7164179104477611 | 4 | 4 | > | catalog | 361 | .7464788732394366 | 5 | 5 | > | catalog | 8189 | .7469879518072289 | 6 | 6 | > | catalog | 8929 | .7625 | 7 | 7 | > | catalog | 14869 | .7717391304347826 | 8 | 8 | > | catalog | 9295 | .7789473684210526 | 9 | 9 | > | catalog | 16215 | .7906976744186046 | 10 |10 | > | store | 9471 | .7750 | 1 | 1 | > | store | 9797 | .8000 | 2 | 2 | > | store | 12641 | .8160919540229885 | 3 | 3 | > | store | 15839 | .8163265306122448 | 4 | 4 | > | store | 1171 | .8241758241758241 | 5 | 5 | > | store | 11589 | .8265306122448979 | 6 | 6 | > | store | 6661 | .9220779220779220 | 7 | 7 | > | store | 13013 | .9420289855072463 | 8 | 8 | > | store | 14925 | .9647058823529411 | 9 | 9 | > | store | 4063 | 1. | 10 |10 | > | store | 9029 | 1. | 10 |10 | > | web | 7539 | .5900 | 1 | 1 | > | web | 3337 | .6265060240963855 | 2 | 2 | > | web | 15597 | .6619718309859154 | 3 | 3 | > | web | 2915 | .6986301369863013 | 4 | 4 | > | web | 11933 | .7171717171717171 | 5 | 5 | > | web | 3305 | .7375 | 6 |16 | > | web | 483 | .8000 | 7 | 6 | > | web |85 | .8571428571428571 | 8 | 7 | > | web |97 | .9036144578313253 | 9 | 8 | > | web | 117 | .9250 | 10 | 9 | > | web | 5299 | .92708333 | 11 |10 | > +-+---++-+---+ > Query used: > -- start query 49 in stream 0 using templa
[jira] [Updated] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13860: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 39 returns wrong results compared to TPC official result set > - > > Key: SPARK-13860 > URL: https://issues.apache.org/jira/browse/SPARK-13860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 39 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > q39a - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) ; q39b > - 3 extra rows in SparkSQL output (eg. > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]) > Actual results 39a: > [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208] > [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977] > [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454] > [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416] > [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604] > [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382] > [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286] > [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812] > [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733] > [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557] > [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564] > [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064] > [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702] > [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768] > [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466] > [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784] > [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149] > [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678] > [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254] > [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555] > [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515] > [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718] > [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428] > [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928] > [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328] > [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966] > [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432] > [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107] > [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363] > [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408] > [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988] > [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882] > [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107] > [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183] > [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321] > [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093] > [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725] > [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028] > [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852] > [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959] > [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601] > [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258] > [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555] > [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061] > [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155] > [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774] > [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956] > [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804] > [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526] > [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678] > [1,15647,1,201.66,1.2857931876095743,1,15647,2,249.25,1.3648172990142162] > [1,16079,1,280.5,1.2444757416128578,1,16079,2,361.25,1.0737
[jira] [Updated] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13859: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > [0] > Expected: > +-+ > | 1 | > +-+ > | 107 | > +-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13858: --- Labels: tpcds-result-mismatch (was: ) > TPCDS query 21 returns wrong results compared to TPC official result set > - > > Key: SPARK-13858 > URL: https://issues.apache.org/jira/browse/SPARK-13858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > Expected results: > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 | 2039 | > | Bad cards must make. | ABDA | 1717 |
[jira] [Updated] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13863: --- Description: Testing Spark SQL using TPC queries. Query 66 returns wrong results compared to official result set. This is at 1GB SF (validation run). Aggregations slightly off -- eg. JAN_SALES column of "Doors canno" row - SparkSQL returns 6355232.185385704, expected 6355232.31 Actual results: [null,null,Fairview,Williamson County,TN,United States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7] [Bad cards must make.,621234,Fairview,Williamson County,TN,United States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7] [Conventional childr,977787,Fairview,Williamson County,TN,United States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7] [Doors canno,294242,Fairview,Williamson County,TN,United States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7] [Important issues liv,138504,Fairview,Williamson County,TN,United States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7] Expected results: +--+---+--+---+-+---+---+--+++++++++++++---+---+---+---+---+---+---+---+---+---+---+---++++++---
[jira] [Updated] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13865: --- Description: Testing Spark SQL using TPC queries. Query 87 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL returns count of 47555, answer set expects 47298. Actual results: [47555] Expected: +---+ | 1 | +---+ | 47298 | +---+ Query used: -- start query 87 in stream 0 using template query87.tpl and seed QUALIFICATION select count(*) from (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as ddate1, 1 as notnull1 from store_sales JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200+11 ) tmp1 left outer join (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as ddate2, 1 as notnull2 from catalog_sales JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk JOIN customer ON catalog_sales.cs_bill_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200+11 ) tmp2 on (tmp1.cln1 = tmp2.cln2) and (tmp1.cfn1 = tmp2.cfn2) and (tmp1.ddate1= tmp2.ddate2) left outer join (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as ddate3, 1 as notnull3 from web_sales JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk where d_month_seq between 1200 and 1200+11 ) tmp3 on (tmp1.cln1 = tmp3.cln3) and (tmp1.cfn1 = tmp3.cfn3) and (tmp1.ddate1= tmp3.ddate3) where notnull2 is null and notnull3 is null ; -- end query 87 in stream 0 using template query87.tpl was: Testing Spark SQL using TPC queries. Query 74 returns wrong results compared to official result set. This is at 1GB SF (validation run). Spark SQL has right answer but in wrong order (and there is an 'order by' in the query). Actual results: [BLEIBAAA,Paula,Wakefield] [DFIEBAAA,John,Gray] [OCLBBAAA,null,null] [PKBCBAAA,Andrea,White] [EJDL,Alice,Wright] [FACE,Priscilla,Miller] [LFKK,Ignacio,Miller] [LJNCBAAA,George,Gamez] [LIOP,Derek,Allen] [EADJ,Ruth,Carroll] [JGMM,Richard,Larson] [PKIK,Wendy,Horvath] [FJHF,Larissa,Roy] [EPOG,Felisha,Mendes] [EKJL,Aisha,Carlson] [HNFH,Rebecca,Wilson] [IBFCBAAA,Ruth,Grantham] [OPDL,Ann,Pence] [NIPL,Eric,Lawrence] [OCIC,Zachary,Pennington] [OFLC,James,Taylor] [GEHI,Tyler,Miller] [CADP,Cristobal,Thomas] [JIAL,Santos,Gutierrez] [PMMBBAAA,Paul,Jordan] [DIIO,David,Carroll] [DFKABAAA,Latoya,Craft] [HMOI,Grace,Henderson] [PPIBBAAA,Candice,Lee] [JONHBAAA,Warren,Orozco] [GNDA,Terry,Mcdowell] [CIJM,Elizabeth,Thomas] [DIJGBAAA,Ruth,Sanders] [NFBDBAAA,Vernice,Fernandez] [IDKF,Michael,Mack] [IMHB,Kathy,Knowles] [LHMC,Brooke,Nelson] [CFCGBAAA,Marcus,Sanders] [NJHCBAAA,Christopher,Schreiber] [PDFB,Terrance,Banks] [ANFA,Philip,Banks] [IADEBAAA,Diane,Aldridge] [ICHF,Linda,Mccoy] [CFEN,Christopher,Dawson] [KOJJ,Gracie,Mendoza] [FOJA,Don,Castillo] [FGPG,Albert,Wadsworth] [KJBK,Georgia,Scott] [EKFP,Annika,Chin] [IBAEBAAA,Sandra,Wilson] [MFFL,Margret,Gray] [KNAK,Gladys,Banks] [CJDI,James,Kerr] [OBADBAAA,Elizabeth,Burnham] [AMGD,Kenneth,Harlan] [HJLA,Audrey,Beltran] [AOPFBAAA,Jerry,Fields] [CNAGBAAA,Virginia,May] [HGOABAAA,Sonia,White] [KBCABAAA,Debra,Bell] [NJAG,Allen,Hood] [MMOBBAAA,Margaret,Smith] [NGDBBAAA,Carlos,Jewell] [FOGI,Michelle,Greene] [JEKFBAAA,Norma,Burkholder] [OCAJ,Jenna,Staton] [PFCL,Felicia,Neville] [DLHBBAAA,Henry,Bertrand] [DBEFBAAA,Bennie,Bowers] [DCKO,Robert,Gonzalez] [KKGE,Katie,Dunbar] [GFMDBAAA,Kathleen,Gibson] [IJEM,Charlie,Cummings] [KJBL,Kerry,Davis] [JKBN,Julie,Kern] [MDCA,Louann,Hamel] [EOAK,Molly,Benjamin] [IBHH,Jennifer,Ballard] [PJEN,Ashley,Norton] [KLHHBAAA,Manuel,Castaneda] [IMHHBAAA,Lillian,Davidson] [GHPBBAAA,N
[jira] [Created] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
JESSE CHEN created SPARK-13863: -- Summary: TPCDS query 66 returns wrong results compared to TPC official result set Key: SPARK-13863 URL: https://issues.apache.org/jira/browse/SPARK-13863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL has right answer but in wrong order (and there is an 'order by' in the query). Actual results: store,9797,0.8000,2,2] [store,12641,0.81609195402298850575,3,3] [store,6661,0.92207792207792207792,7,7] [store,13013,0.94202898550724637681,8,8] [store,9029,1.,10,10] [web,15597,0.66197183098591549296,3,3] [store,14925,0.96470588235294117647,9,9] [store,4063,1.,10,10] [catalog,8929,0.7625,7,7] [store,11589,0.82653061224489795918,6,6] [store,1171,0.82417582417582417582,5,5] [store,9471,0.7750,1,1] [catalog,12577,0.65591397849462365591,3,3] [web,97,0.90361445783132530120,9,8] [web,85,0.85714285714285714286,8,7] [catalog,361,0.74647887323943661972,5,5] [web,2915,0.69863013698630136986,4,4] [web,117,0.9250,10,9] [catalog,9295,0.77894736842105263158,9,9] [web,3305,0.7375,6,16] [catalog,16215,0.79069767441860465116,10,10] [web,7539,0.5900,1,1] [catalog,17543,0.57142857142857142857,1,1] [catalog,3411,0.71641791044776119403,4,4] [web,11933,0.71717171717171717172,5,5] [catalog,14513,0.63541667,2,2] [store,15839,0.81632653061224489796,4,4] [web,3337,0.62650602409638554217,2,2] [web,5299,0.92708333,11,10] [catalog,8189,0.74698795180722891566,6,6] [catalog,14869,0.77173913043478260870,8,8] [web,483,0.8000,7,6] Expected results: +-+---++-+---+ | CHANNEL | ITEM | RETURN_RATIO | RETURN_RANK | CURRENCY_RANK | +-+---++-+---+ | catalog | 17543 | .5714285714285714 | 1 | 1 | | catalog | 14513 | .63541666 | 2 | 2 | | catalog | 12577 | .6559139784946236 | 3 | 3 | | catalog | 3411 | .7164179104477611 | 4 | 4 | | catalog | 361 | .7464788732394366 | 5 | 5 | | catalog | 8189 | .7469879518072289 | 6 | 6 | | catalog | 8929 | .7625 | 7 | 7 | | catalog | 14869 | .7717391304347826 | 8 | 8 | | catalog | 9295 | .7789473684210526 | 9 | 9 | | catalog | 16215 | .7906976744186046 | 10 |10 | | store | 9471 | .7750 | 1 | 1 | | store | 9797 | .8000 | 2 | 2 | | store | 12641 | .8160919540229885 | 3 | 3 | | store | 15839 | .8163265306122448 | 4 | 4 | | store | 1171 | .8241758241758241 | 5 | 5 | | store | 11589 | .8265306122448979 | 6 | 6 | | store | 6661 | .9220779220779220 | 7 | 7 | | store | 13013 | .9420289855072463 | 8 | 8 | | store | 14925 | .9647058823529411 | 9 | 9 | | store | 4063 | 1. | 10 |10 | | store | 9029 | 1. | 10 |10 | | web | 7539 | .5900 | 1 | 1 | | web | 3337 | .6265060240963855 | 2 | 2 | | web | 15597 | .6619718309859154 | 3 | 3 | | web | 2915 | .6986301369863013 | 4 | 4 | | web | 11933 | .7171717171717171 | 5 | 5 | | web | 3305 | .7375 | 6 |16 | | web | 483 | .8000 | 7 | 6 | | web |85 | .8571428571428571 | 8 | 7 | | web |97 | .9036144578313253 | 9 | 8 | | web | 117 | .9250 | 10 | 9 | | web | 5299 | .92708333 | 11 |10 | +-+---++-+---+ Query used: -- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION select 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select ws.ws_item_sk as item