[jira] [Commented] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,

2017-01-05 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803628#comment-15803628
 ] 

JESSE CHEN commented on SPARK-19068:


Well, though it does not affect the correctness of the results, but a query 
that seemingly takes only 30 minutes now takes 2.5 hours is a concern to Spark 
users. I used the 'spark-sql' shell so not until the shell exit, normal users 
will not know the query actually finished. Plus, Spark is hogging resources 
(memory and cores) until SparkContext exits, so this is an usability and trust 
issue.

I also think this always occurs on high volume and on a large cluster. As Spark 
is being adapted by enterprise users, this issue will be in the fore-front. 

I do think there is a fundamental timing issue here. 

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr

2017-01-03 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-19068:
---
Attachment: sparklog.tar.gz

This is the Spark console output in which you can find settings and sequence of 
events. At end you will see the "never-ending" event dropping messages. 

> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
> Attachments: sparklog.tar.gz
>
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(51,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(535,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(63,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(333,WrappedArray())
> 16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(484,WrappedArray())
> (omitted) 
> {noformat}
> The message itself maybe a reasonable response to a already stopped 
> SparkListenerBus (so subsequent events are thrown away with that ERROR 
> message). The issue is that because SparkContext does NOT exit until all 
> these ERROR/events are reported, which is a huge number in our setup -- and 
> this can take, in some cases, hours!!!
> We tried increasing the 
> Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
> from 10K, this still occurs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr

2017-01-03 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-19068:
---
Description: 
On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
queries tend to report the following ERRORs

{noformat}
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(136,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(853,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(395,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(736,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(439,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(16,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(307,WrappedArray())
16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(51,WrappedArray())
16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(535,WrappedArray())
16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(63,WrappedArray())
16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(333,WrappedArray())
16/12/27 12:44:29 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(484,WrappedArray())
(omitted) 
{noformat}

The message itself maybe a reasonable response to a already stopped 
SparkListenerBus (so subsequent events are thrown away with that ERROR 
message). The issue is that because SparkContext does NOT exit until all these 
ERROR/events are reported, which is a huge number in our setup -- and this can 
take, in some cases, hours!!!

We tried increasing the 
Adding default property: spark.scheduler.listenerbus.eventqueue.size=13
from 10K, this still occurs. 



> Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(41,WrappedArray())
> --
>
> Key: SPARK-19068
> URL: https://issues.apache.org/jira/browse/SPARK-19068
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: RHEL 7.2
>Reporter: JESSE CHEN
>
> On a large cluster with 45TB RAM and 1,000 cores, we used 1008 executors in 
> order to use all RAM and cores for a 100TB Spark SQL workload. Long-running 
> queries tend to report the following ERRORs
> {noformat}
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(136,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(853,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(395,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(736,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(439,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(16,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(307,WrappedArray())
> 16/12/27 12:44:28 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpda

[jira] [Created] (SPARK-19068) Large number of executors causing a ton of ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(41,Wr

2017-01-03 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-19068:
--

 Summary: Large number of executors causing a ton of ERROR 
scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event 
SparkListenerExecutorMetricsUpdate(41,WrappedArray())
 Key: SPARK-19068
 URL: https://issues.apache.org/jira/browse/SPARK-19068
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
 Environment: RHEL 7.2
Reporter: JESSE CHEN






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-06 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18745:
---
Labels:   (was: core dump)

> java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
> -
>
> Key: SPARK-18745
> URL: https://issues.apache.org/jira/browse/SPARK-18745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Assignee: Kazuaki Ishizaki
>Priority: Critical
> Fix For: 2.1.0
>
>
> Running query 68 with decreased executor memory (using 12GB executors instead 
> of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave 
> IndexOutOfBoundsException.
> The query is as follows:
> {noformat}
> [select  c_last_name
>,c_first_name
>,ca_city
>,bought_city
>,ss_ticket_number
>,extended_price
>,extended_tax
>,list_price
>  from (select ss_ticket_number
>  ,ss_customer_sk
>  ,ca_city bought_city
>  ,sum(ss_ext_sales_price) extended_price 
>  ,sum(ss_ext_list_price) list_price
>  ,sum(ss_ext_tax) extended_tax 
>from store_sales
>,date_dim
>,store
>,household_demographics
>,customer_address 
>where store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  and store_sales.ss_store_sk = store.s_store_sk  
> and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
> and store_sales.ss_addr_sk = customer_address.ca_address_sk
> and date_dim.d_dom between 1 and 2 
> and (household_demographics.hd_dep_count = 8 or
>  household_demographics.hd_vehicle_count= -1)
> and date_dim.d_year in (2000,2000+1,2000+2)
> and store.s_city in ('Plainview','Rogers')
>group by ss_ticket_number
>,ss_customer_sk
>,ss_addr_sk,ca_city) dn
>   ,customer
>   ,customer_address current_addr
>  where ss_customer_sk = c_customer_sk
>and customer.c_current_addr_sk = current_addr.ca_address_sk
>and current_addr.ca_city <> bought_city
>  order by c_last_name
>  ,ss_ticket_number
>   limit 100]
> {noformat}
> Spark output that showed the exception:
> {noformat}
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodeg

[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-06 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18745:
---
Description: 
Running query 68 with decreased executor memory (using 12GB executors instead 
of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave 
IndexOutOfBoundsException.

The query is as follows:
{noformat}
[select  c_last_name
   ,c_first_name
   ,ca_city
   ,bought_city
   ,ss_ticket_number
   ,extended_price
   ,extended_tax
   ,list_price
 from (select ss_ticket_number
 ,ss_customer_sk
 ,ca_city bought_city
 ,sum(ss_ext_sales_price) extended_price 
 ,sum(ss_ext_list_price) list_price
 ,sum(ss_ext_tax) extended_tax 
   from store_sales
   ,date_dim
   ,store
   ,household_demographics
   ,customer_address 
   where store_sales.ss_sold_date_sk = date_dim.d_date_sk
 and store_sales.ss_store_sk = store.s_store_sk  
and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
and store_sales.ss_addr_sk = customer_address.ca_address_sk
and date_dim.d_dom between 1 and 2 
and (household_demographics.hd_dep_count = 8 or
 household_demographics.hd_vehicle_count= -1)
and date_dim.d_year in (2000,2000+1,2000+2)
and store.s_city in ('Plainview','Rogers')
   group by ss_ticket_number
   ,ss_customer_sk
   ,ss_addr_sk,ca_city) dn
  ,customer
  ,customer_address current_addr
 where ss_customer_sk = c_customer_sk
   and customer.c_current_addr_sk = current_addr.ca_address_sk
   and current_addr.ca_city <> bought_city
 order by c_last_name
 ,ss_ticket_number
  limit 100]
{noformat}

Spark output that showed the exception:
{noformat}
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at 
org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at 
org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.consume(SortMergeJoinExec.scala:35)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:560)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org

[jira] [Created] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-06 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-18745:
--

 Summary: java.lang.IndexOutOfBoundsException running query 68 
Spark SQL on (100TB)
 Key: SPARK-18745
 URL: https://issues.apache.org/jira/browse/SPARK-18745
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: JESSE CHEN
Assignee: Kazuaki Ishizaki
Priority: Critical
 Fix For: 2.1.0


Running a query on 100TB parquet database using the Spark master dated 11/04 
dump cores on Spark executors.

The query is TPCDS query 82 (though this query is not the only one can produce 
this core dump, just the easiest one to re-create the error).

Spark output that showed the exception:
{noformat}
16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container marked as failed: container_e68_1478924651089_0018_01_74 on host: 
mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
container-launch.
Container id: container_e68_1478924651089_0018_01_74
Exit code: 134
Exception message: /bin/bash: line 1: 4031216 Aborted (core 
dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
-server -Xmx24576m 
-Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
-Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
 -XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname 
mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 
--user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
 > 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
 2> 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted 
(core dumped) 
/usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
-Xmx24576m 
-Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
-Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
 -XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname 
mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 
--user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
 > 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
 2> 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr

at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.la

[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-15 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18458:
---
Fix Version/s: (was: 2.0.0)

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at org.apache.hadoop.util.Shell.run(Shell.java:456)
> at 
> org.apache.hadoop.util.Shel

[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-15 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18458:
---
Description: 
Running a query on 100TB parquet database using the Spark master dated 11/04 
dump cores on Spark executors.

The query is TPCDS query 82 (though this query is not the only one can produce 
this core dump, just the easiest one to re-create the error).

Spark output that showed the exception:
{noformat}
16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container marked as failed: container_e68_1478924651089_0018_01_74 on host: 
mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
container-launch.
Container id: container_e68_1478924651089_0018_01_74
Exit code: 134
Exception message: /bin/bash: line 1: 4031216 Aborted (core 
dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
-server -Xmx24576m 
-Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
-Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
 -XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname 
mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 
--user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
 > 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
 2> 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted 
(core dumped) 
/usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
-Xmx24576m 
-Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
 '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
-Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
 -XX:OnOutOfMemoryError='kill %p' 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname 
mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 
--user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
 --user-class-path 
file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
 > 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
 2> 
/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr

at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.r

[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-15 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18458:
---
Labels: core dump  (was: tpcds-result-mismatch)

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>  Labels: core, dump
> Fix For: 2.0.0
>
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at org.apache.hadoop.util.Shell.run(Shell.java:4

[jira] [Updated] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-15 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-18458:
---
Affects Version/s: (was: 1.6.0)
   2.1.0

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>  Labels: core, dump
> Fix For: 2.0.0
>
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at org.apache.hadoop.util.Shell

[jira] [Created] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-15 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-18458:
--

 Summary: core dumped running Spark SQL on large data volume (100TB)
 Key: SPARK-18458
 URL: https://issues.apache.org/jira/browse/SPARK-18458
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: JESSE CHEN
 Fix For: 2.0.0


Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL has right answer but in wrong order (and there is an 'order by' in the 
query).

Actual results:
{noformat}
store,9797,0.8000,2,2]
[store,12641,0.81609195402298850575,3,3]
[store,6661,0.92207792207792207792,7,7]
[store,13013,0.94202898550724637681,8,8]
[store,9029,1.,10,10]
[web,15597,0.66197183098591549296,3,3]
[store,14925,0.96470588235294117647,9,9]
[store,4063,1.,10,10]
[catalog,8929,0.7625,7,7]
[store,11589,0.82653061224489795918,6,6]
[store,1171,0.82417582417582417582,5,5]
[store,9471,0.7750,1,1]
[catalog,12577,0.65591397849462365591,3,3]
[web,97,0.90361445783132530120,9,8]
[web,85,0.85714285714285714286,8,7]
[catalog,361,0.74647887323943661972,5,5]
[web,2915,0.69863013698630136986,4,4]
[web,117,0.9250,10,9]
[catalog,9295,0.77894736842105263158,9,9]
[web,3305,0.7375,6,16]
[catalog,16215,0.79069767441860465116,10,10]
[web,7539,0.5900,1,1]
[catalog,17543,0.57142857142857142857,1,1]
[catalog,3411,0.71641791044776119403,4,4]
[web,11933,0.71717171717171717172,5,5]
[catalog,14513,0.63541667,2,2]
[store,15839,0.81632653061224489796,4,4]
[web,3337,0.62650602409638554217,2,2]
[web,5299,0.92708333,11,10]
[catalog,8189,0.74698795180722891566,6,6]
[catalog,14869,0.77173913043478260870,8,8]
[web,483,0.8000,7,6]
{noformat}


Expected results:
{noformat}
+-+---++-+---+
| CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
+-+---++-+---+
| catalog | 17543 |  .5714285714285714 |   1 | 1 |
| catalog | 14513 |  .63541666 |   2 | 2 |
| catalog | 12577 |  .6559139784946236 |   3 | 3 |
| catalog |  3411 |  .7164179104477611 |   4 | 4 |
| catalog |   361 |  .7464788732394366 |   5 | 5 |
| catalog |  8189 |  .7469879518072289 |   6 | 6 |
| catalog |  8929 |  .7625 |   7 | 7 |
| catalog | 14869 |  .7717391304347826 |   8 | 8 |
| catalog |  9295 |  .7789473684210526 |   9 | 9 |
| catalog | 16215 |  .7906976744186046 |  10 |10 |
| store   |  9471 |  .7750 |   1 | 1 |
| store   |  9797 |  .8000 |   2 | 2 |
| store   | 12641 |  .8160919540229885 |   3 | 3 |
| store   | 15839 |  .8163265306122448 |   4 | 4 |
| store   |  1171 |  .8241758241758241 |   5 | 5 |
| store   | 11589 |  .8265306122448979 |   6 | 6 |
| store   |  6661 |  .9220779220779220 |   7 | 7 |
| store   | 13013 |  .9420289855072463 |   8 | 8 |
| store   | 14925 |  .9647058823529411 |   9 | 9 |
| store   |  4063 | 1. |  10 |10 |
| store   |  9029 | 1. |  10 |10 |
| web |  7539 |  .5900 |   1 | 1 |
| web |  3337 |  .6265060240963855 |   2 | 2 |
| web | 15597 |  .6619718309859154 |   3 | 3 |
| web |  2915 |  .6986301369863013 |   4 | 4 |
| web | 11933 |  .7171717171717171 |   5 | 5 |
| web |  3305 |  .7375 |   6 |16 |
| web |   483 |  .8000 |   7 | 6 |
| web |85 |  .8571428571428571 |   8 | 7 |
| web |97 |  .9036144578313253 |   9 | 8 |
| web |   117 |  .9250 |  10 | 9 |
| web |  5299 |  .92708333 |  11 |10 |
+-+---++-+---+
{noformat}

Query used:
{noformat}
-- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION
  select  
 'web' as channel
 ,web.item
 ,web.return_ratio
 ,web.return_rank
 ,web.currency_rank
 from (
select 
 item
,return_ratio
,currency_ratio
,rank() over (order by return_ratio) as return_rank
,rank() over (order by currency_ratio) as currency_rank
   

[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-06-23 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347338#comment-15347338
 ] 

JESSE CHEN commented on SPARK-13288:


[~AlexSparkJiang]The code is:
val multiTweetStreams=(1 to numStreams).map {i => 
KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet) }
// unified stream
val tweetStream= ssc.union(multiTweetStreams)


> [1.6.0] Memory leak in Spark streaming
> --
>
> Key: SPARK-13288
> URL: https://issues.apache.org/jira/browse/SPARK-13288
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Bare metal cluster
> RHEL 6.6
>Reporter: JESSE CHEN
>  Labels: streaming
>
> Streaming in 1.6 seems to have a memory leak.
> Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
> showed a gradual increasing processing time. 
> The app is simple: 1 Kafka receiver of tweet stream and 20 executors 
> processing the tweets in 5-second batches. 
> Spark 1.5.0 handles this smoothly and did not show increasing processing time 
> in the 40-minute test; but 1.6 showed increasing time about 8 minutes into 
> the test. Please see chart here:
> https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116
> I captured heap dumps in two version and did a comparison. I noticed the Byte 
> is using 50X more space in 1.5.1.
> Here are some top classes in heap histogram and references. 
> Heap Histogram
>   
> All Classes (excluding platform)  
>   1.6.0 Streaming 1.5.1 Streaming 
> Class Instance Count  Total Size  Class   Instance Count  Total 
> Size
> class [B  84533,227,649,599   class [B5095
> 62,938,466
> class [C  44682   4,255,502   class [C130482  
> 12,844,182
> class java.lang.reflect.Method90591,177,670   class 
> java.lang.String  130171  1,562,052
>   
>   
> References by TypeReferences by Type  
>   
> class [B [0x640039e38]class [B [0x6c020bb08]  
> 
>   
> Referrers by Type Referrers by Type   
>   
> Class Count   Class   Count   
> java.nio.HeapByteBuffer   3239
> sun.security.util.DerInputBuffer1233
> sun.security.util.DerInputBuffer  1233
> sun.security.util.ObjectIdentifier  620 
> sun.security.util.ObjectIdentifier620 [[B 397 
> [Ljava.lang.Object;   408 java.lang.reflect.Method
> 326 
> 
> The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
> The Java.nio.HeapByteBuffer referencing class did not show up in top in 
> 1.5.1. 
> I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
> here
> https://ibm.box.com/sparkstreaming-jstack160
> https://ibm.box.com/sparkstreaming-jstack151
> Jesse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-18 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289356#comment-15289356
 ] 

JESSE CHEN commented on SPARK-15372:


[~freiss] I agree with you. 
In addition, in order to match TPC official result, we are allowed to use minor 
query modification for treatment of null strings (e.g., coalesce), so the 
following query now runs and returns the matching results to TPC:
{noformat}
  select  c_customer_id as customer_id
   ,concat(c_last_name , ', ' , coalesce(c_first_name,'')) as customername
 from customer
 ,customer_address
 ,customer_demographics
 ,household_demographics
 ,income_band
 ,store_returns
 where ca_city  =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 5
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;
{noformat}

> TPC-DS Qury 84 returns wrong results against TPC official
> -
>
> Key: SPARK-15372
> URL: https://issues.apache.org/jira/browse/SPARK-15372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
>  Labels: SPARK-15071
>
> The official TPC-DS query 84 returns wrong results when compared to its 
> official answer set.
> The query itself is:
> {noformat}
>   select  c_customer_id as customer_id
>,concat(c_last_name , ', ' , c_first_name) as customername
>  from customer
>  ,customer_address
>  ,customer_demographics
>  ,household_demographics
>  ,income_band
>  ,store_returns
>  where ca_city  =  'Edgewood'
>and c_current_addr_sk = ca_address_sk
>and ib_lower_bound   >=  38128
>and ib_upper_bound   <=  38128 + 5
>and ib_income_band_sk = hd_income_band_sk
>and cd_demo_sk = c_current_cdemo_sk
>and hd_demo_sk = c_current_hdemo_sk
>and sr_cdemo_sk = cd_demo_sk
>  order by c_customer_id
>   limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AIPG  Carter, Rodney
> AKMBBAAA  Mcarthur, Emma
> CBNHBAAA  Wells, Ron
> DBME  Vera, Tina
> DBME  Vera, Tina
> DHKGBAAA  Scott, Pamela
> EIIBBAAA  Atkins, Susan
> FKAH  Batiste, Ernest
> GHMA  Mitchell, Gregory
> IAODBAAA  Murray, Karen
> IEOK  Solomon, Clyde
> IIBO  Owens, David
> IPDC  Wallace, Eric
> IPIM  Hayward, Benjamin
> JCIK  Ramos, Donald
> KFJE  Roberts, Yvonne
> KPGBBAAA  NULL < ??? questionable row
> LCLABAAA  Whitaker, Lettie
> MGME  Sharp, Michael
> MIGBBAAA  Montgomery, Jesenia
> MPDK  Lopez, Isabel
> NEOM  Powell, Linda
> NKPC  Shaffer, Sergio
> NOCK  Vargas, James
> OGJEBAAA  Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AIPG Carter  , Rodney
> AKMBBAAA Mcarthur, Emma
> CBNHBAAA Wells   , Ron
> DBME Vera, Tina
> DBME Vera, Tina
> DHKGBAAA Scott   , Pamela
> EIIBBAAA Atkins  , Susan
> FKAH Batiste , Ernest
> GHMA Mitchell, Gregory
> IAODBAAA Murray  , Karen
> IEOK Solomon , Clyde
> IIBO Owens   , David
> IPDC Wallace , Eric
> IPIM Hayward , Benjamin
> JCIK Ramos   , Donald
> KFJE Roberts , Yvonne
> KPGBBAAA Moore   ,
> LCLABAAA Whitaker, Lettie
> MGME Sharp   , Michael
> MIGBBAAA Montgomery  , Jesenia
> MPDK Lopez   , Isabel
> NEOM Powell  , Linda
> NKPC Shaffer , Sergio
> NOCK Vargas  , James
> OGJEBAAA Owens   , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQ

[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-18 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289358#comment-15289358
 ] 

JESSE CHEN commented on SPARK-15372:


I will close this next, not a problem with Spark SQL.

> TPC-DS Qury 84 returns wrong results against TPC official
> -
>
> Key: SPARK-15372
> URL: https://issues.apache.org/jira/browse/SPARK-15372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
>  Labels: SPARK-15071
>
> The official TPC-DS query 84 returns wrong results when compared to its 
> official answer set.
> The query itself is:
> {noformat}
>   select  c_customer_id as customer_id
>,concat(c_last_name , ', ' , c_first_name) as customername
>  from customer
>  ,customer_address
>  ,customer_demographics
>  ,household_demographics
>  ,income_band
>  ,store_returns
>  where ca_city  =  'Edgewood'
>and c_current_addr_sk = ca_address_sk
>and ib_lower_bound   >=  38128
>and ib_upper_bound   <=  38128 + 5
>and ib_income_band_sk = hd_income_band_sk
>and cd_demo_sk = c_current_cdemo_sk
>and hd_demo_sk = c_current_hdemo_sk
>and sr_cdemo_sk = cd_demo_sk
>  order by c_customer_id
>   limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AIPG  Carter, Rodney
> AKMBBAAA  Mcarthur, Emma
> CBNHBAAA  Wells, Ron
> DBME  Vera, Tina
> DBME  Vera, Tina
> DHKGBAAA  Scott, Pamela
> EIIBBAAA  Atkins, Susan
> FKAH  Batiste, Ernest
> GHMA  Mitchell, Gregory
> IAODBAAA  Murray, Karen
> IEOK  Solomon, Clyde
> IIBO  Owens, David
> IPDC  Wallace, Eric
> IPIM  Hayward, Benjamin
> JCIK  Ramos, Donald
> KFJE  Roberts, Yvonne
> KPGBBAAA  NULL < ??? questionable row
> LCLABAAA  Whitaker, Lettie
> MGME  Sharp, Michael
> MIGBBAAA  Montgomery, Jesenia
> MPDK  Lopez, Isabel
> NEOM  Powell, Linda
> NKPC  Shaffer, Sergio
> NOCK  Vargas, James
> OGJEBAAA  Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AIPG Carter  , Rodney
> AKMBBAAA Mcarthur, Emma
> CBNHBAAA Wells   , Ron
> DBME Vera, Tina
> DBME Vera, Tina
> DHKGBAAA Scott   , Pamela
> EIIBBAAA Atkins  , Susan
> FKAH Batiste , Ernest
> GHMA Mitchell, Gregory
> IAODBAAA Murray  , Karen
> IEOK Solomon , Clyde
> IIBO Owens   , David
> IPDC Wallace , Eric
> IPIM Hayward , Benjamin
> JCIK Ramos   , Donald
> KFJE Roberts , Yvonne
> KPGBBAAA Moore   ,
> LCLABAAA Whitaker, Lettie
> MGME Sharp   , Michael
> MIGBBAAA Montgomery  , Jesenia
> MPDK Lopez   , Isabel
> NEOM Powell  , Linda
> NKPC Shaffer , Sergio
> NOCK Vargas  , James
> OGJEBAAA Owens   , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same 
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. 
> But is this right? When I concatenate a person's last name and first name, if 
> the first name is missing (empty string or NULL), I should see the last name 
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. 
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id = 
> 'KPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'KPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where 
> c_customer_id = 'KPGBBAAA';
> OK

[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-17 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15372:
---
Description: 
The official TPC-DS query 84 returns wrong results when compared to its 
official answer set.

The query itself is:
{noformat}
  select  c_customer_id as customer_id
   ,concat(c_last_name , ', ' , c_first_name) as customername
 from customer
 ,customer_address
 ,customer_demographics
 ,household_demographics
 ,income_band
 ,store_returns
 where ca_city  =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 5
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;

{noformat}

Spark 2.0 build 0517 returned the following result:
{noformat}
AIPGCarter, Rodney
AKMBBAAAMcarthur, Emma
CBNHBAAAWells, Ron
DBMEVera, Tina
DBMEVera, Tina
DHKGBAAAScott, Pamela
EIIBBAAAAtkins, Susan
FKAHBatiste, Ernest
GHMAMitchell, Gregory
IAODBAAAMurray, Karen
IEOKSolomon, Clyde
IIBOOwens, David
IPDCWallace, Eric
IPIMHayward, Benjamin
JCIKRamos, Donald
KFJERoberts, Yvonne
KPGBBAAANULL < ??? questionable row
LCLABAAAWhitaker, Lettie
MGMESharp, Michael
MIGBBAAAMontgomery, Jesenia
MPDKLopez, Isabel
NEOMPowell, Linda
NKPCShaffer, Sergio
NOCKVargas, James
OGJEBAAAOwens, Denice

{noformat}

Official answer set (which is correct!)
{noformat}
AIPG Carter, Rodney
AKMBBAAA Mcarthur  , Emma
CBNHBAAA Wells , Ron
DBME Vera  , Tina
DBME Vera  , Tina
DHKGBAAA Scott , Pamela
EIIBBAAA Atkins, Susan
FKAH Batiste   , Ernest
GHMA Mitchell  , Gregory
IAODBAAA Murray, Karen
IEOK Solomon   , Clyde
IIBO Owens , David
IPDC Wallace   , Eric
IPIM Hayward   , Benjamin
JCIK Ramos , Donald
KFJE Roberts   , Yvonne
KPGBBAAA Moore ,
LCLABAAA Whitaker  , Lettie
MGME Sharp , Michael
MIGBBAAA Montgomery, Jesenia
MPDK Lopez , Isabel
NEOM Powell, Linda
NKPC Shaffer   , Sergio
NOCK Vargas, James
OGJEBAAA Owens , Denice

{noformat}

The issue is with the "concat" function in Spark SQL (also behaves the same in 
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But 
is this right? When I concatenate a person's last name and first name, if the 
first name is missing (empty string or NULL), I should see the last name still, 
not NULL, i.e., "Smith" + "" = "Smith", not NULL. 

Simplest repeatable test:
hive> select c_first_name, c_last_name from customer where c_customer_id = 
'KPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
c_customer_id = 'KPGBBAAA';
OK
Moore, KPGBBAAA

Same in 'spark-sql' shell:

...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from customer where 

[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-17 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15372:
---
Description: 
The official TPC-DS query 84 returns wrong results when compared to its 
official answer set.

The query itself is:
{noformat}
  select  c_customer_id as customer_id
   ,concat(c_last_name , ', ' , c_first_name) as customername
 from customer
 ,customer_address
 ,customer_demographics
 ,household_demographics
 ,income_band
 ,store_returns
 where ca_city  =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 5
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;

{noformat}

Spark 2.0 build 0517 returned the following result:
{noformat}
AIPGCarter, Rodney
AKMBBAAAMcarthur, Emma
CBNHBAAAWells, Ron
DBMEVera, Tina
DBMEVera, Tina
DHKGBAAAScott, Pamela
EIIBBAAAAtkins, Susan
FKAHBatiste, Ernest
GHMAMitchell, Gregory
IAODBAAAMurray, Karen
IEOKSolomon, Clyde
IIBOOwens, David
IPDCWallace, Eric
IPIMHayward, Benjamin
JCIKRamos, Donald
KFJERoberts, Yvonne
KPGBBAAANULL < ??? questionable row
LCLABAAAWhitaker, Lettie
MGMESharp, Michael
MIGBBAAAMontgomery, Jesenia
MPDKLopez, Isabel
NEOMPowell, Linda
NKPCShaffer, Sergio
NOCKVargas, James
OGJEBAAAOwens, Denice

{noformat}

Official answer set (which is correct!)
{noformat}
AIPG Carter, Rodney
AKMBBAAA Mcarthur  , Emma
CBNHBAAA Wells , Ron
DBME Vera  , Tina
DBME Vera  , Tina
DHKGBAAA Scott , Pamela
EIIBBAAA Atkins, Susan
FKAH Batiste   , Ernest
GHMA Mitchell  , Gregory
IAODBAAA Murray, Karen
IEOK Solomon   , Clyde
IIBO Owens , David
IPDC Wallace   , Eric
IPIM Hayward   , Benjamin
JCIK Ramos , Donald
KFJE Roberts   , Yvonne
KPGBBAAA Moore ,
LCLABAAA Whitaker  , Lettie
MGME Sharp , Michael
MIGBBAAA Montgomery, Jesenia
MPDK Lopez , Isabel
NEOM Powell, Linda
NKPC Shaffer   , Sergio
NOCK Vargas, James
OGJEBAAA Owens , Denice

{noformat}

The issue is with the "concat" function in Spark SQL (also behaves the same in 
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But 
is this right? When I concatenate a person's last name and first name, if the 
first name is missing (empty string or NULL), I should see the last name still, 
not NULL, i.e., "Smith" + "" = "Smith", not NULL. 

Simplest repeatable test:
{noformat}
hive> select c_first_name, c_last_name from customer where c_customer_id = 
'KPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'KPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
c_customer_id = 'KPGBBAAA';
OK
Moore, KPGBBAAA

Same in 'spark-sql' shell:

...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from cust

[jira] [Updated] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-17 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15372:
---
Labels: SPARK-15071  (was: )

> TPC-DS Qury 84 returns wrong results against TPC official
> -
>
> Key: SPARK-15372
> URL: https://issues.apache.org/jira/browse/SPARK-15372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
>  Labels: SPARK-15071
> Fix For: 2.0.0
>
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
> (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subque

[jira] [Created] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

2016-05-17 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-15372:
--

 Summary: TPC-DS Qury 84 returns wrong results against TPC official
 Key: SPARK-15372
 URL: https://issues.apache.org/jira/browse/SPARK-15372
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: JESSE CHEN
Assignee: Herman van Hovell
Priority: Critical
 Fix For: 2.0.0


The official TPC-DS query 41 fails with the following error:

{noformat}
Error in query: The correlated scalar subquery can only contain equality 
predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && 
((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) || 
(i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra large || 
(((i_category#36 = Women) && ((i_color#41 = brown) || (i_color#41 = honeydew))) 
&& (((i_units#42 = Bunch) || (i_units#42 = Ton)) && ((i_size#39 = N/A) || 
(i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = floral) 
|| (i_color#41 = deep))) && (((i_units#42 = N/A) || (i_units#42 = Dozen)) && 
((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && 
((i_color#41 = light) || (i_color#41 = cornflower))) && (((i_units#42 = Box) || 
(i_units#42 = Pound)) && ((i_size#39 = medium) || (i_size#39 = extra 
large))) || ((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
&& ((i_color#41 = midnight) || (i_color#41 = snow))) && (((i_units#42 = Pallet) 
|| (i_units#42 = Gross)) && ((i_size#39 = medium) || (i_size#39 = extra 
large || (((i_category#36 = Women) && ((i_color#41 = cyan) || (i_color#41 = 
papaya))) && (((i_units#42 = Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) 
|| (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = 
orange) || (i_color#41 = frosted))) && (((i_units#42 = Each) || (i_units#42 = 
Tbl)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = 
Men) && ((i_color#41 = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) 
|| (i_units#42 = Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra 
large;
{noformat}

The output plans showed the following errors
{noformat}
== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
   +- 'Sort ['i_product_name ASC], true
  +- 'Distinct
 +- 'Project ['i_product_name]
+- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
40))) && (scalar-subquery#1 [] > 0))
   :  +- 'SubqueryAlias scalar-subquery#1 []
   : +- 'Project ['count(1) AS item_cnt#0]
   :+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
extra large || ((('i_category = Women) && (('i_color = brown) || ('i_color 
= honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && (('i_size = N/A) 
|| ('i_size = small) || 'i_category = Men) && (('i_color = floral) || 
('i_color = deep))) && ((('i_units = N/A) || ('i_units = Dozen)) && (('i_size = 
petite) || ('i_size = large || ((('i_category = Men) && (('i_color = light) 
|| ('i_color = cornflower))) && ((('i_units = Box) || ('i_units = Pound)) && 
(('i_size = medium) || ('i_size = extra large))) || (('i_manufact = 
'i1.i_manufact) && ('i_category = Women) && (('i_color = midnight) || 
('i_color = snow))) && ((('i_units = Pallet) || ('i_units = Gross)) && 
(('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && 
(('i_color = cyan) || ('i_color = papaya))) && ((('i_units = Cup) || ('i_units 
= Dram)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) 
&& (('i_color = orange) || ('i_color = frosted))) && ((('i_units = Each) || 
('i_units = Tbl)) && (('i_size = petite) || ('i_size = large || 
((('i_category = Men) && (('i_color = forest) || ('i_color = ghost))) && 
((('i_units = Lb) || ('i_units = Bundle)) && (('i_size = medium) || ('i_size = 
extra large
   :   +- 'UnresolvedRelation `item`, None
   +- 'UnresolvedRelation `item`, Some(i1)

== Analyzed Logical Plan ==
i_product_name: string
GlobalLimit 100
+- LocalLimit 100
   +- Sort [i_product_name#24 ASC], true
  +- Distinct
 +- Project [i_product_name#24]
+- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
(i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 
[(((i_manufact#39 = i_manufact#17) && (i_category#37 = Women) && 
((i_color#42 = powder) || (i_color#42 = khaki))) && (((i_units#43 = Ounce) || 
(i_units#43 = Oz)) && ((i_size#40 = medium) || (i_size#40 = extra large || 
(((i_category#37 = Women) && ((i_color#42 = brown) || (i_color#42 = honeydew))) 
&& (((i_units#43 = Bunch) || (i_units#43 = Ton)) && ((i_size#40 =

[jira] [Closed] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-09 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-15122.
--

Verified successfully in 0508 build. Thanks!

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
> Fix For: 2.0.0
>
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
> (i_manufact_id#16

[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-09 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276600#comment-15276600
 ] 

JESSE CHEN commented on SPARK-15122:


works great! now all 99 queries pass! nicely done!

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
> Fix For: 2.0.0
>
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((

[jira] [Closed] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing

2016-05-05 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-14968.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

fixed per SPARK-14785

> TPC-DS query 1 resolved attribute(s) missing
> 
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> with customer_total_return as
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
>  limit 100
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271128#comment-15271128
 ] 

JESSE CHEN commented on SPARK-15122:


Query41 official version:
{noformat}
  select  distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
from item
where (i_manufact = i1.i_manufact and
((i_category = 'Women' and
(i_color = 'powder' or i_color = 'khaki') and
(i_units = 'Ounce' or i_units = 'Oz') and
(i_size = 'medium' or i_size = 'extra large')
) or
(i_category = 'Women' and
(i_color = 'brown' or i_color = 'honeydew') and
(i_units = 'Bunch' or i_units = 'Ton') and
(i_size = 'N/A' or i_size = 'small')
) or
(i_category = 'Men' and
(i_color = 'floral' or i_color = 'deep') and
(i_units = 'N/A' or i_units = 'Dozen') and
(i_size = 'petite' or i_size = 'large')
) or
(i_category = 'Men' and
(i_color = 'light' or i_color = 'cornflower') and
(i_units = 'Box' or i_units = 'Pound') and
(i_size = 'medium' or i_size = 'extra large')
))) or
   (i_manufact = i1.i_manufact and
((i_category = 'Women' and
(i_color = 'midnight' or i_color = 'snow') and
(i_units = 'Pallet' or i_units = 'Gross') and
(i_size = 'medium' or i_size = 'extra large')
) or
(i_category = 'Women' and
(i_color = 'cyan' or i_color = 'papaya') and
(i_units = 'Cup' or i_units = 'Dram') and
(i_size = 'N/A' or i_size = 'small')
) or
(i_category = 'Men' and
(i_color = 'orange' or i_color = 'frosted') and
(i_units = 'Each' or i_units = 'Tbl') and
(i_size = 'petite' or i_size = 'large')
) or
(i_category = 'Men' and
(i_color = 'forest' or i_color = 'ghost') and
(i_units = 'Lb' or i_units = 'Bundle') and
(i_size = 'medium' or i_size = 'extra large')
 > 0
 order by i_product_name
  limit 100;
{noformat}


> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i

[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15122:
---
Priority: Critical  (was: Major)

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
> (i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 
> [(((i_manufact#39 = 

[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15122:
---
Description: 
The official TPC-DS query 41 fails with the following error:

{noformat}
Error in query: The correlated scalar subquery can only contain equality 
predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) && 
((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) || 
(i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra large || 
(((i_category#36 = Women) && ((i_color#41 = brown) || (i_color#41 = honeydew))) 
&& (((i_units#42 = Bunch) || (i_units#42 = Ton)) && ((i_size#39 = N/A) || 
(i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = floral) 
|| (i_color#41 = deep))) && (((i_units#42 = N/A) || (i_units#42 = Dozen)) && 
((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = Men) && 
((i_color#41 = light) || (i_color#41 = cornflower))) && (((i_units#42 = Box) || 
(i_units#42 = Pound)) && ((i_size#39 = medium) || (i_size#39 = extra 
large))) || ((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
&& ((i_color#41 = midnight) || (i_color#41 = snow))) && (((i_units#42 = Pallet) 
|| (i_units#42 = Gross)) && ((i_size#39 = medium) || (i_size#39 = extra 
large || (((i_category#36 = Women) && ((i_color#41 = cyan) || (i_color#41 = 
papaya))) && (((i_units#42 = Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) 
|| (i_size#39 = small) || i_category#36 = Men) && ((i_color#41 = 
orange) || (i_color#41 = frosted))) && (((i_units#42 = Each) || (i_units#42 = 
Tbl)) && ((i_size#39 = petite) || (i_size#39 = large || (((i_category#36 = 
Men) && ((i_color#41 = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) 
|| (i_units#42 = Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra 
large;
{noformat}

The output plans showed the following errors
{noformat}
== Parsed Logical Plan ==
'GlobalLimit 100
+- 'LocalLimit 100
   +- 'Sort ['i_product_name ASC], true
  +- 'Distinct
 +- 'Project ['i_product_name]
+- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
40))) && (scalar-subquery#1 [] > 0))
   :  +- 'SubqueryAlias scalar-subquery#1 []
   : +- 'Project ['count(1) AS item_cnt#0]
   :+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
extra large || ((('i_category = Women) && (('i_color = brown) || ('i_color 
= honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && (('i_size = N/A) 
|| ('i_size = small) || 'i_category = Men) && (('i_color = floral) || 
('i_color = deep))) && ((('i_units = N/A) || ('i_units = Dozen)) && (('i_size = 
petite) || ('i_size = large || ((('i_category = Men) && (('i_color = light) 
|| ('i_color = cornflower))) && ((('i_units = Box) || ('i_units = Pound)) && 
(('i_size = medium) || ('i_size = extra large))) || (('i_manufact = 
'i1.i_manufact) && ('i_category = Women) && (('i_color = midnight) || 
('i_color = snow))) && ((('i_units = Pallet) || ('i_units = Gross)) && 
(('i_size = medium) || ('i_size = extra large || ((('i_category = Women) && 
(('i_color = cyan) || ('i_color = papaya))) && ((('i_units = Cup) || ('i_units 
= Dram)) && (('i_size = N/A) || ('i_size = small) || 'i_category = Men) 
&& (('i_color = orange) || ('i_color = frosted))) && ((('i_units = Each) || 
('i_units = Tbl)) && (('i_size = petite) || ('i_size = large || 
((('i_category = Men) && (('i_color = forest) || ('i_color = ghost))) && 
((('i_units = Lb) || ('i_units = Bundle)) && (('i_size = medium) || ('i_size = 
extra large
   :   +- 'UnresolvedRelation `item`, None
   +- 'UnresolvedRelation `item`, Some(i1)

== Analyzed Logical Plan ==
i_product_name: string
GlobalLimit 100
+- LocalLimit 100
   +- Sort [i_product_name#24 ASC], true
  +- Distinct
 +- Project [i_product_name#24]
+- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
(i_manufact_id#16L <= cast((738 + 40) as bigint))) && (scalar-subquery#1 
[(((i_manufact#39 = i_manufact#17) && (i_category#37 = Women) && 
((i_color#42 = powder) || (i_color#42 = khaki))) && (((i_units#43 = Ounce) || 
(i_units#43 = Oz)) && ((i_size#40 = medium) || (i_size#40 = extra large || 
(((i_category#37 = Women) && ((i_color#42 = brown) || (i_color#42 = honeydew))) 
&& (((i_units#43 = Bunch) || (i_units#43 = Ton)) && ((i_size#40 = N/A) || 
(i_size#40 = small) || i_category#37 = Men) && ((i_color#42 = floral) 
|| (i_color#42 = deep))) && (((i_units#43 = N/A) || (i_units#43 = Dozen)) && 
((i_size#40 = petite) || (i_size#40 = large || (((i_category#37 = Men) && 
((i_color#42 = light) || (i_color#42 = 

[jira] [Created] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-15122:
--

 Summary: TPC-DS Qury 41 fails with The correlated scalar subquery 
can only contain equality predicates
 Key: SPARK-15122
 URL: https://issues.apache.org/jira/browse/SPARK-15122
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: JESSE CHEN


Hi I am testing on spark 2.0 but dont see an option to select it yet. 

TPC-DS query 23 fails with the compile error
Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?])
line 4:33 cannot recognize input near '' '' '' in subquery source
; line 4 pos 33

I could narrow the error to an aggregation on a subquery.

select max(csales) tpcds_cmax
  from (select sum(ss_quantity*ss_sales_price) csales
from store_sales
group by ss_customer_sk) ;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15122:
---
Affects Version/s: (was: 1.6.1)
   2.0.0

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Hi I am testing on spark 2.0 but dont see an option to select it yet. 
> TPC-DS query 23 fails with the compile error
> Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?])
> line 4:33 cannot recognize input near '' '' '' in subquery 
> source
> ; line 4 pos 33
> I could narrow the error to an aggregation on a subquery.
> select max(csales) tpcds_cmax
>   from (select sum(ss_quantity*ss_sales_price) csales
> from store_sales
> group by ss_customer_sk) ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing

2016-04-29 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Description: 
This is a regression from a week ago. Failed to generate plan for query 1 in 
TPCDS using 0427 build from 
people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.

Was working in build from 0421.

The error is:
{noformat}
16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd 
at CliDriver.java:376
16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes.
Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
ctr_store_sk#2);
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/static/sql,null}
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}

{noformat}

The query is:
{noformat}
with customer_total_return as
(select sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(SR_RETURN_AMT) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year =2000
group by sr_customer_sk
,sr_store_sk)
 select  c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'TN'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
 limit 100

{noformat}



  was:
This is a regression from a week ago. Failed to generate plan for query 1 in 
TPCDS using 0427 build from 
people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.

Was working in build from 0421.

The error is:
{noformat}
16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd 
at CliDriver.java:376
16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes.
Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
ctr_store_sk#2);
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/static/sql,null}
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}

{noformat}

The query is:
{noformat}
(select sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(SR_RETURN_AMT) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year =2000
group by sr_customer_sk
,sr_store_sk)
 select  c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'TN'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
 limit 100

{noformat}




> TPC-DS query 1 resolved attribute(s) missing
> 
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> with customer_total_return as
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_tota

[jira] [Updated] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing

2016-04-27 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Summary: TPC-DS query 1 resolved attribute(s) missing  (was: TPC-DS query 1 
fails to generate plan)

> TPC-DS query 1 resolved attribute(s) missing
> 
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
>  limit 100
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan

2016-04-27 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Description: 
This is a regression from a week ago. Failed to generate plan for query 1 in 
TPCDS using 0427 build from 
people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.

Was working in build from 0421.

The error is:
{noformat}
16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from processCmd 
at CliDriver.java:376
16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes.
Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
ctr_store_sk#2);
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/static/sql,null}
16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}

{noformat}

The query is:
{noformat}
(select sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(SR_RETURN_AMT) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year =2000
group by sr_customer_sk
,sr_store_sk)
 select  c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'TN'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
 limit 100

{noformat}



  was:
This is a regression from a week ago. Failed to generate plan for query 1 in 
TPCDS using 0427 build from 
people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.

Was working in 


> TPC-DS query 1 fails to generate plan
> -
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
>  limit 100
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan

2016-04-27 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Affects Version/s: (was: 1.6.1)
   2.0.0

> TPC-DS query 1 fails to generate plan
> -
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan

2016-04-27 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Priority: Critical  (was: Major)

> TPC-DS query 1 fails to generate plan
> -
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14968) TPC-DS query 1 fails to generate plan

2016-04-27 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14968:
---
Description: 
This is a regression from a week ago. Failed to generate plan for query 1 in 
TPCDS using 0427 build from 
people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.

Was working in 

  was:
Hi I am testing on spark 2.0 but dont see an option to select it yet. 

TPC-DS query 23 fails with the compile error
Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?])
line 4:33 cannot recognize input near '' '' '' in subquery source
; line 4 pos 33

I could narrow the error to an aggregation on a subquery.

select max(csales) tpcds_cmax
  from (select sum(ss_quantity*ss_sales_price) csales
from store_sales
group by ss_customer_sk) ;



> TPC-DS query 1 fails to generate plan
> -
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14968) TPC-DS query 1 fails to generate plan

2016-04-27 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-14968:
--

 Summary: TPC-DS query 1 fails to generate plan
 Key: SPARK-14968
 URL: https://issues.apache.org/jira/browse/SPARK-14968
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: JESSE CHEN


Hi I am testing on spark 2.0 but dont see an option to select it yet. 

TPC-DS query 23 fails with the compile error
Error in query: NoViableAltException(-1@[237:51: ( KW_AS )?])
line 4:33 cannot recognize input near '' '' '' in subquery source
; line 4 pos 33

I could narrow the error to an aggregation on a subquery.

select max(csales) tpcds_cmax
  from (select sum(ss_quantity*ss_sales_price) csales
from store_sales
group by ss_customer_sk) ;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-04-25 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256836#comment-15256836
 ] 

JESSE CHEN commented on SPARK-14521:


This fix will allow us to use Kyro again (in spark-sql shell and spark-submit). 
Somehow the workaround described above did not work for me in spark 2.0. Did I 
miss something else? My workaround is actually use the java serializer for now, 
which sees performance hit. 

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-04-21 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-14096.
--
Resolution: Duplicate

SPARK-14521

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * tmp2.avg_i_current_price
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>limit 100;
> {noformat}
> query 38 (succeeded)
> {noform

[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-04-21 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252597#comment-15252597
 ] 

JESSE CHEN commented on SPARK-14096:


But the simplest workaround is to use the 
spark.serializer org.apache.spark.serializer.JavaSerializer 
for now.

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * tmp2.avg_i_curren

[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-04-21 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252592#comment-15252592
 ] 

JESSE CHEN commented on SPARK-14096:


duplicate of SPARK-14521

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * tmp2.avg_i_current_price
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>limit 100;

[jira] [Closed] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-14616.
--
Resolution: Not A Problem

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> {code:title=tpcds q44}
>  select  asceding.rnk, i1.i_product_name best_performing, i2.i_product_name 
> worst_performing
> from(select *
>  from (select item_sk,rank() over (order by rank_col asc) rnk
>from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
>  from store_sales ss1
>  where ss_store_sk = 4
>  group by ss_item_sk
>  having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
> rank_col
>   from store_sales
>   where ss_store_sk = 4
> and ss_addr_sk is null
>   group by ss_store_sk))V1)V11
>  where rnk  < 11) asceding,
> (select *
>  from (select item_sk,rank() over (order by rank_col desc) rnk
>from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
>  from store_sales ss1
>  where ss_store_sk = 4
>  group by ss_item_sk
>  having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
> rank_col
>   from store_sales
>   where ss_store_sk = 4
> and ss_addr_sk is null
>   group by ss_store_sk))V2)V21
>  where rnk  < 11) descending,
> item i1,
> item i2
> where asceding.rnk = descending.rnk
>   and i1.i_item_sk=asceding.item_sk
>   and i2.i_item_sk=descending.item_sk
> order by asceding.rnk
>  limit 100;
> {code}
> {noformat}
> bin/spark-sql  --driver-memory 10g --verbose --master yarn-client  --packages 
> com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 
> --executor-cores 2 --database hadoopds1g  -f q44.sql
> {noformat}
> {noformat}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- Project [item_sk#0,rank_col#1]
>: +- Filter havingCondition#219: boolean
>:+- TungstenAggregate(key=[ss_item_sk#12], 
> functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], 
> output=[havingCondition#219,item_sk#0,rank_col#1])
>:   +- INPUT
>+- Exchange hashpartitioning(ss_item_sk#12,200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[ss_item_sk#12], 
> functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], 
> output=[ss_item_sk#12,sum#612,count#613L])
>  : +- Project [ss_item_sk#12,ss_net_profit#32]
>  :+- Filter (ss_store_sk#17 = 4)
>  :   +- INPUT
>  +- Scan ParquetRelation: 
> hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] 
> InputPaths: 
> hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales,
>  PushedFilters: [EqualTo(ss_store_sk,4)]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> at 
> org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
> at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
> at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
> at 
> org.apache.spark.rdd.

[jira] [Commented] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-14 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241860#comment-15241860
 ] 

JESSE CHEN commented on SPARK-14616:


Build from yesterday did not have this problem. Closing.

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> {code:title=tpcds q44}
>  select  asceding.rnk, i1.i_product_name best_performing, i2.i_product_name 
> worst_performing
> from(select *
>  from (select item_sk,rank() over (order by rank_col asc) rnk
>from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
>  from store_sales ss1
>  where ss_store_sk = 4
>  group by ss_item_sk
>  having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
> rank_col
>   from store_sales
>   where ss_store_sk = 4
> and ss_addr_sk is null
>   group by ss_store_sk))V1)V11
>  where rnk  < 11) asceding,
> (select *
>  from (select item_sk,rank() over (order by rank_col desc) rnk
>from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
>  from store_sales ss1
>  where ss_store_sk = 4
>  group by ss_item_sk
>  having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
> rank_col
>   from store_sales
>   where ss_store_sk = 4
> and ss_addr_sk is null
>   group by ss_store_sk))V2)V21
>  where rnk  < 11) descending,
> item i1,
> item i2
> where asceding.rnk = descending.rnk
>   and i1.i_item_sk=asceding.item_sk
>   and i2.i_item_sk=descending.item_sk
> order by asceding.rnk
>  limit 100;
> {code}
> {noformat}
> bin/spark-sql  --driver-memory 10g --verbose --master yarn-client  --packages 
> com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 
> --executor-cores 2 --database hadoopds1g  -f q44.sql
> {noformat}
> {noformat}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- Project [item_sk#0,rank_col#1]
>: +- Filter havingCondition#219: boolean
>:+- TungstenAggregate(key=[ss_item_sk#12], 
> functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], 
> output=[havingCondition#219,item_sk#0,rank_col#1])
>:   +- INPUT
>+- Exchange hashpartitioning(ss_item_sk#12,200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[ss_item_sk#12], 
> functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], 
> output=[ss_item_sk#12,sum#612,count#613L])
>  : +- Project [ss_item_sk#12,ss_net_profit#32]
>  :+- Filter (ss_store_sk#17 = 4)
>  :   +- INPUT
>  +- Scan ParquetRelation: 
> hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] 
> InputPaths: 
> hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales,
>  PushedFilters: [EqualTo(ss_store_sk,4)]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> at 
> org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
> at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
> at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkP

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Description: 
{code:title=tpcds q44}
 select  asceding.rnk, i1.i_product_name best_performing, i2.i_product_name 
worst_performing
from(select *
 from (select item_sk,rank() over (order by rank_col asc) rnk
   from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
 from store_sales ss1
 where ss_store_sk = 4
 group by ss_item_sk
 having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
rank_col
  from store_sales
  where ss_store_sk = 4
and ss_addr_sk is null
  group by ss_store_sk))V1)V11
 where rnk  < 11) asceding,
(select *
 from (select item_sk,rank() over (order by rank_col desc) rnk
   from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
 from store_sales ss1
 where ss_store_sk = 4
 group by ss_item_sk
 having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
rank_col
  from store_sales
  where ss_store_sk = 4
and ss_addr_sk is null
  group by ss_store_sk))V2)V21
 where rnk  < 11) descending,
item i1,
item i2
where asceding.rnk = descending.rnk
  and i1.i_item_sk=asceding.item_sk
  and i2.i_item_sk=descending.item_sk
order by asceding.rnk
 limit 100;

{code}

{noformat}
bin/spark-sql  --driver-memory 10g --verbose --master yarn-client  --packages 
com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 
--executor-cores 2 --database hadoopds1g  -f q44.sql
{noformat}

{noformat}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition, None
+- WholeStageCodegen
   :  +- Project [item_sk#0,rank_col#1]
   : +- Filter havingCondition#219: boolean
   :+- TungstenAggregate(key=[ss_item_sk#12], 
functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], 
output=[havingCondition#219,item_sk#0,rank_col#1])
   :   +- INPUT
   +- Exchange hashpartitioning(ss_item_sk#12,200), None
  +- WholeStageCodegen
 :  +- TungstenAggregate(key=[ss_item_sk#12], 
functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], 
output=[ss_item_sk#12,sum#612,count#613L])
 : +- Project [ss_item_sk#12,ss_net_profit#32]
 :+- Filter (ss_store_sk#17 = 4)
 :   +- INPUT
 +- Scan ParquetRelation: 
hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] 
InputPaths: 
hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales,
 PushedFilters: [EqualTo(ss_store_sk,4)]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at 
org.apache.spark.sql.execution.InputAdapter.upstream(WholeStageCodegen.scala:176)
at 
org.apache.spark.sql.execution.Filter.upstream(basicOperators.scala:73)
at 
org.apache.spark.sql.execution.Project.upstream(basicOperators.scala:35)
at 
org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:279)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$exec

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Environment: (was: spark 1.5.1 (official binary distribution) running 
on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8))

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> {code:title=/tmp/bug.py}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext()
> sqlc = SQLContext(sc)
> R = Row('id', 'foo')
> r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
> q = sqlc.createDataFrame(sc.parallelize([R('', 
> 'bar')]))
> q.write.parquet('/tmp/1.parq')
> q = sqlc.read.parquet('/tmp/1.parq')
> j = r.join(q, r.id == q.id)
> print j.count()
> {code}
> {noformat}
> [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
> [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
> {noformat}
> {noformat}
> 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 119.90324 ms
> Traceback (most recent call last):
>   File "/tmp/bug.py", line 13, in 
> print j.count()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 268, in count
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, 
> tree:
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#10L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#13L])
>TungstenProject
> BroadcastHashJoin [id#0], [id#8], BuildRight
>  TungstenProject [id#0]
>   Scan PhysicalRDD[id#0,foo#1]
>  ConvertToUnsafe
>   Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Note this happens only under following condition:
> # executor memory >= 32GB (doesn't fail with up to 31 GB)
> # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
> more then 24 chars)
> # q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Affects Version/s: (was: 1.5.1)
   2.0.0

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 1.5.1 (official binary distribution) running on 
> hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)
>Reporter: JESSE CHEN
>
> {code:title=/tmp/bug.py}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext()
> sqlc = SQLContext(sc)
> R = Row('id', 'foo')
> r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
> q = sqlc.createDataFrame(sc.parallelize([R('', 
> 'bar')]))
> q.write.parquet('/tmp/1.parq')
> q = sqlc.read.parquet('/tmp/1.parq')
> j = r.join(q, r.id == q.id)
> print j.count()
> {code}
> {noformat}
> [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
> [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
> {noformat}
> {noformat}
> 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 119.90324 ms
> Traceback (most recent call last):
>   File "/tmp/bug.py", line 13, in 
> print j.count()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 268, in count
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, 
> tree:
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#10L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#13L])
>TungstenProject
> BroadcastHashJoin [id#0], [id#8], BuildRight
>  TungstenProject [id#0]
>   Scan PhysicalRDD[id#0,foo#1]
>  ConvertToUnsafe
>   Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Note this happens only under following condition:
> # executor memory >= 32GB (doesn't fail with up to 31 GB)
> # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
> more then 24 chars)
> # q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Created] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-14616:
--

 Summary: TreeNodeException running Q44 and 58 on Parquet tables
 Key: SPARK-14616
 URL: https://issues.apache.org/jira/browse/SPARK-14616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: spark 1.5.1 (official binary distribution) running on 
hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)
Reporter: JESSE CHEN


{code:title=/tmp/bug.py}
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext()
sqlc = SQLContext(sc)

R = Row('id', 'foo')
r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
q = sqlc.createDataFrame(sc.parallelize([R('', 'bar')]))
q.write.parquet('/tmp/1.parq')
q = sqlc.read.parquet('/tmp/1.parq')
j = r.join(q, r.id == q.id)
print j.count()
{code}

{noformat}
[user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
[user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
{noformat}

{noformat}
15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
119.90324 ms
Traceback (most recent call last):
  File "/tmp/bug.py", line 13, in 
print j.count()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
268, in count
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
line 538, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
in deco
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#10L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#13L])
   TungstenProject
BroadcastHashJoin [id#0], [id#8], BuildRight
 TungstenProject [id#0]
  Scan PhysicalRDD[id#0,foo#1]
 ConvertToUnsafe
  Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
at 
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at 
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Note this happens only under following condition:
# executor memory >= 32GB (doesn't fail with up to 31 GB)
# the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
more then 24 chars)
# q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2016-04-11 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235649#comment-15235649
 ] 

JESSE CHEN commented on SPARK-13860:


[~tsuresh]  Could you inform the correct course of action here please?

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> {noformat}
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678]
> [1,15647,1,201.66,1.2857931876095743

[jira] [Closed] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-04-11 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13307.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Thanks.

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
> Fix For: 2.0.0
>
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1

2016-04-11 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235635#comment-15235635
 ] 

JESSE CHEN commented on SPARK-13307:


Performance back on track on Spark 2.0. Closing this.

> TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
> -
>
> Key: SPARK-13307
> URL: https://issues.apache.org/jira/browse/SPARK-13307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average 
> about 9% faster. There are a few degraded, and one that is definitely not 
> within error margin is query 66.
> Query 66 in 1.4.1: 699 seconds
> Query 66 in 1.6.0: 918 seconds
> 30% worse.
> Collected the physical plans from both versions - drastic difference maybe 
> partially from using Tungsten in 1.6, but anything else at play here?
> Please see plans here:
> https://ibm.box.com/spark-sql-q66-debug-160plan
> https://ibm.box.com/spark-sql-q66-debug-141plan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-04-01 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-14318.
--
Resolution: Not A Problem

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
> Attachments: threaddump-1459461915668.tdump
>
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
> {noformat}
> The Spark console output:
> {noformat}
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
> 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
> 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
> 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
> 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
> 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
> 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
> 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
> 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 
> on executor id: 2 hostname: bigaperf137.svl.ibm.com.
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
> 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
> {noformat}
> Notice that time durations between tasks are unusually long: 2~5 minutes.
> When looking at the Linux 'perf' tool, two top CPU consumers are:
> 86.48%java  [unknown]   
> 12.41%libjvm.so
> Using the Java hotspot profiling tools, I am able to show what hotspot 
> methods are (top 5):
> {noformat}
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten()   
> 46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
> 9,654,179 ms
> org.apache.spark.unsafe.Platform.copyMemory() 18.631157   3,848,442 ms 
> (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
> org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185   
> 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue()
>4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms   
>  2,153,910 ms
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()
> 4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
> 19,967,510 ms
> {noformat}
> So as you can see, the test has been running for 1.5 hours...with 46% CPU 
> spent in the 
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 
> The stacks for top two are:
> {noformat}
> Marshalling   
> I
> java/io/DataOutputStream.writeInt() line 197
> org.​apache.​spark.​sql   
> I
> org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue()
>  line 60
> org.​apache.​spark.​storage   
> I
> org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
> org.​apache.​spark.​shuffle   
> I
> org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
> org.​apache.​spark.​sc

[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-04-01 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221943#comment-15221943
 ] 

JESSE CHEN commented on SPARK-14318:


intersect should be used. Investigating further. 

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
> Attachments: threaddump-1459461915668.tdump
>
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
> {noformat}
> The Spark console output:
> {noformat}
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
> 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
> 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
> 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
> 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
> 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
> 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
> 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
> 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 
> on executor id: 2 hostname: bigaperf137.svl.ibm.com.
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
> 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
> {noformat}
> Notice that time durations between tasks are unusually long: 2~5 minutes.
> When looking at the Linux 'perf' tool, two top CPU consumers are:
> 86.48%java  [unknown]   
> 12.41%libjvm.so
> Using the Java hotspot profiling tools, I am able to show what hotspot 
> methods are (top 5):
> {noformat}
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten()   
> 46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
> 9,654,179 ms
> org.apache.spark.unsafe.Platform.copyMemory() 18.631157   3,848,442 ms 
> (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
> org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185   
> 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue()
>4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms   
>  2,153,910 ms
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()
> 4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
> 19,967,510 ms
> {noformat}
> So as you can see, the test has been running for 1.5 hours...with 46% CPU 
> spent in the 
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 
> The stacks for top two are:
> {noformat}
> Marshalling   
> I
> java/io/DataOutputStream.writeInt() line 197
> org.​apache.​spark.​sql   
> I
> org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue()
>  line 60
> org.​apache.​spark.​storage   
> I
> org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
> org.​apache.​spark.​shuffle   
> I
> org/apache/spark/shu

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Attachment: threaddump-1459461915668.tdump

here is the thread dump taken during the high CPU usage on the executor.

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
> Attachments: threaddump-1459461915668.tdump
>
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
> {noformat}
> The Spark console output:
> {noformat}
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
> 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
> 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
> 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
> 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
> 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
> 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
> 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
> 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 
> on executor id: 2 hostname: bigaperf137.svl.ibm.com.
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
> 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
> {noformat}
> Notice that time durations between tasks are unusually long: 2~5 minutes.
> When looking at the Linux 'perf' tool, two top CPU consumers are:
> 86.48%java  [unknown]   
> 12.41%libjvm.so
> Using the Java hotspot profiling tools, I am able to show what hotspot 
> methods are (top 5):
> {noformat}
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten()   
> 46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
> 9,654,179 ms
> org.apache.spark.unsafe.Platform.copyMemory() 18.631157   3,848,442 ms 
> (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
> org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185   
> 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue()
>4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms   
>  2,153,910 ms
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()
> 4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
> 19,967,510 ms
> {noformat}
> So as you can see, the test has been running for 1.5 hours...with 46% CPU 
> spent in the 
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 
> The stacks for top two are:
> {noformat}
> Marshalling   
> I
> java/io/DataOutputStream.writeInt() line 197
> org.​apache.​spark.​sql   
> I
> org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue()
>  line 60
> org.​apache.​spark.​storage   
> I
> org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
> org.​apache.​spark.​shuffle   
> I
> org/apa

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms **(46.8%)**9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I
org/apache/spark/u

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I
org/apache/spark/unsafe/Pl

[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220864#comment-15220864
 ] 

JESSE CHEN commented on SPARK-14318:


Q14 is as follows:
{noformat}
with  cross_items as
 (select i_item_sk ss_item_sk
 from item
 JOIN
 (select brand_id, class_id, category_id from
 (select iss.i_brand_id brand_id
 ,iss.i_class_id class_id
 ,iss.i_category_id category_id
 from store_sales
 ,item iss
 ,date_dim d1
 where ss_item_sk = iss.i_item_sk
   and ss_sold_date_sk = d1.d_date_sk
   and d1.d_year between 1999 AND 1999 + 2) x1
 JOIN
 (select ics.i_brand_id
 ,ics.i_class_id
 ,ics.i_category_id
 from catalog_sales
 ,item ics
 ,date_dim d2
 where cs_item_sk = ics.i_item_sk
   and cs_sold_date_sk = d2.d_date_sk
   and d2.d_year between 1999 AND 1999 + 2) x2
   ON x1.brand_id = x2.i_brand_id and
  x1.class_id = x2.i_class_id and
  x1.category_id = x2.i_category_id
 JOIN
 (select iws.i_brand_id
 ,iws.i_class_id
 ,iws.i_category_id
 from web_sales
 ,item iws
 ,date_dim d3
 where ws_item_sk = iws.i_item_sk
   and ws_sold_date_sk = d3.d_date_sk
   and d3.d_year between 1999 AND 1999 + 2) x3
   ON x1.brand_id = x3.i_brand_id and
  x1.class_id = x3.i_class_id and
  x1.category_id = x3.i_category_id
 ) x4
 where i_brand_id = x4.brand_id
  and i_class_id = x4.class_id
  and i_category_id = x4.category_id
),
 avg_sales as
 (select avg(quantity*list_price) average_sales
  from (select ss_quantity quantity
 ,ss_list_price list_price
   from store_sales
   ,date_dim
   where ss_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2
   union all
   select cs_quantity quantity
 ,cs_list_price list_price
   from catalog_sales
   ,date_dim
   where cs_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2
   union all
   select ws_quantity quantity
 ,ws_list_price list_price
   from web_sales
   ,date_dim
   where ws_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2) x)
  select  * from
 (select 'store' channel, i_brand_id,i_class_id,i_category_id
,sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales
 from store_sales ss1
 JOIN item ON ss1.ss_item_sk = i_item_sk
 JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk
 JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk
 JOIN avg_sales
 JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq
 where dd2.d_year = 1999 + 1
   and dd2.d_moy = 12
   and dd2.d_dom = 11
 group by average_sales,i_brand_id,i_class_id,i_category_id
 having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) 
this_year,
 (select 'store' channel, i_brand_id,i_class_id
,i_category_id, sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) 
number_sales
 from store_sales ss1
 JOIN item ON ss1.ss_item_sk = i_item_sk
 JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk
 JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk
 JOIN avg_sales
 JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq
 where dd2.d_year = 1999
   and dd2.d_moy = 12
   and dd2.d_dom = 11
 group by average_sales, i_brand_id,i_class_id,i_category_id
 having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) 
last_year
 where this_year.i_brand_id= last_year.i_brand_id
   and this_year.i_class_id = last_year.i_class_id
   and this_year.i_category_id = last_year.i_category_id
 order by this_year.channel, this_year.i_brand_id, this_year.i_class_id, 
this_year.i_category_id
   limit 100
{noformat}



> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I
org/apache/spark/unsafe/Pl

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Labels: hangs  (was: tpcds-result-mismatch)

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards must make. | ABDA |   1717 |  1782 |
> | Bad cards must mak

[jira] [Created] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-14318:
--

 Summary: TPCDS query 14 causes Spark SQL to hang
 Key: SPARK-14318
 URL: https://issues.apache.org/jira/browse/SPARK-14318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: JESSE CHEN


Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
other rows are missing as well.

Actual results:
{noformat}
[null,AABD,2565,1922]
[null,AAHD,2956,2052]
[null,AALA,2042,1793]
[null,ACGC,2373,1771]
[null,ACKC,2321,1856]
[null,ACOB,1504,1397]
[null,ADKB,1820,2163]
[null,AEAD,2631,1965]
[null,AEOC,1659,1798]
[null,AFAC,1965,1705]
[null,AFAD,1769,1313]
[null,AHDE,2700,1985]
[null,AHHA,1578,1082]
[null,AIEC,1756,1804]
[null,AIMC,3603,2951]
[null,AJAC,2109,1989]
[null,AJKB,2573,3540]
[null,ALBE,3458,2992]
[null,ALCE,1720,1810]
[null,ALEC,2569,1946]
[null,ALNB,2552,1750]
[null,ANFE,2022,2269]
[null,AOIB,2982,2540]
[null,APJB,2344,2593]
[null,BAPD,2182,2787]
[null,BDCE,2844,2069]
[null,BDDD,2417,2537]
[null,BDJA,1584,1666]
[null,BEOD,2141,2649]
[null,BFCC,2745,2020]
[null,BFMB,1642,1364]
[null,BHPC,1923,1780]
[null,BIDB,1956,2836]
[null,BIGB,2023,2344]
[null,BIJB,1977,2728]
[null,BJFE,1891,2390]
[null,BLDE,1983,1797]
[null,BNID,2485,2324]
[null,BNLD,2385,2786]
[null,BOMB,2291,2092]
[null,CAAA,2233,2560]
[null,CBCD,1540,2012]
[null,CBIA,2394,2122]
[null,CBPB,1790,1661]
[null,CCMD,2654,2691]
[null,CDBC,1804,2072]
[null,CFEA,1941,1567]
[null,CGFD,2123,2265]
[null,CHPC,2933,2174]
[null,CIGD,2618,2399]
[null,CJCB,2728,2367]
[null,CJLA,1350,1732]
[null,CLAE,2578,2329]
[null,CLGA,1842,1588]
[null,CLLB,3418,2657]
[null,CLOB,3115,2560]
[null,CMAD,1991,2243]
[null,CMJA,1261,1855]
[null,CMLA,3288,2753]
[null,CMPD,1320,1676]
[null,CNGB,2340,2118]
[null,CNHD,3519,3348]
[null,CNPC,2561,1948]
[null,DCPC,2664,2627]
[null,DDHA,1313,1926]
[null,DDND,1109,835]
[null,DEAA,2141,1847]
[null,DEJA,3142,2723]
[null,DFKB,1470,1650]
[null,DGCC,2113,2331]
[null,DGFC,2201,2928]
[null,DHPA,2467,2133]
[null,DMBA,3085,2087]
[null,DPAB,3494,3081]
[null,EAEC,2133,2148]
[null,EAPA,1560,1275]
[null,ECGC,2815,3307]
[null,EDPD,2731,1883]
[null,EEEC,2024,1902]
[null,EEMC,2624,2387]
[null,EFFA,2047,1878]
[null,EGJA,2403,2633]
[null,EGMA,2784,2772]
[null,EGOC,2389,1753]
[null,EHFD,1940,1420]
[null,EHLB,2320,2057]
[null,EHPA,1898,1853]
[null,EIPB,2930,2326]
[null,EJAE,2582,1836]
[null,EJIB,2257,1681]
[null,EJJA,2791,1941]
[null,EJJD,3410,2405]
[null,EJNC,2472,2067]
[null,EJPD,1219,1229]
[null,EKEB,2047,1713]
[null,EMEA,2502,1897]
[null,EMKC,2362,2042]
[null,ENAC,2011,1909]
[null,ENFB,2507,2162]
[null,ENOD,3371,2709]
{noformat}


Expected results:
{noformat}
+--+--++---+
| W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
+--+--++---+
| Bad cards must make. | AACD |   1889 |  2168 |
| Bad cards must make. | AAHD |   2739 |  2039 |
| Bad cards must make. | ABDA |   1717 |  1782 |
| Bad cards must make. | ACGC |   2296 |  2276 |
| Bad cards must make. | ACKC |   2443 |  1878 |
| Bad cards must make. | ACOB |   2705 |  2428 |
| Bad cards must make. | ADGB |   2242 |  2759 |
| Bad cards must make. | ADKB |   2138 |  2456 |
| Bad cards must make. | AEAD |   2914 |  2237 |
| Bad cards must make. | AEOC |   1797 |  2073 |
| Bad 

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Affects Version/s: 2.0.0

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards must make. | ABDA |   1717 |  1782 |
> | Bad cards must make. | ACGCAA

[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-30 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218937#comment-15218937
 ] 

JESSE CHEN commented on SPARK-13820:


We are able to run 93 now. We should shoot for all 99. And this JIRA will fix 2 
more :)

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-29 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216500#comment-15216500
 ] 

JESSE CHEN commented on SPARK-13862:


PR fixed the issue. New result is ordered correctly.

{noformat}catalog 17543   0.57142857142857142857  1   1
catalog 14513   0.63541667  2   2
catalog 12577   0.65591397849462365591  3   3
catalog 34110.71641791044776119403  4   4
catalog 361 0.74647887323943661972  5   5
catalog 81890.74698795180722891566  6   6
catalog 89290.7625  7   7
catalog 14869   0.7717391304347826087   8   8
catalog 92950.77894736842105263158  9   9
catalog 16215   0.79069767441860465116  10  10
store   94710.775   1   1
store   97970.8 2   2
store   12641   0.81609195402298850575  3   3
store   15839   0.81632653061224489796  4   4
store   11710.82417582417582417582  5   5
store   11589   0.82653061224489795918  6   6
store   66610.92207792207792207792  7   7
store   13013   0.94202898550724637681  8   8
store   14925   0.96470588235294117647  9   9
store   90291   10  10
store   40631   10  10
web 75390.591   1
web 33370.62650602409638554217  2   2
web 15597   0.66197183098591549296  3   3
web 29150.69863013698630136986  4   4
web 11933   0.71717171717171717172  5   5
web 33050.7375  6   16
web 483 0.8 7   6
web 85  0.85714285714285714286  8   7
web 97  0.9036144578313253012   9   8
web 117 0.925   10  9
web 52990.92708333  11  10
{noformat}

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> {noformat}
> Expected results:
> {noformat}
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 |  

[jira] [Closed] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-29 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13862.
--

PR fixed this issue. Thanks, [~smilegator]

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> {noformat}
> Expected results:
> {noformat}
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 | 2 |
> | store   | 12641 |  .8160919540229885 |   3 | 3 |
> | store   | 15839 |  .8163265306122448 |   4 | 4 |
> | store   |  1171 |  .8241758241758241 |   5 | 5 |
> | store   | 11589 |  .8265306122448979 |   6 | 6 |
> | store   |  6661 |  .9220779220779220 |   7 | 7 |
> | store   | 13013 |  .9420289855072463 |   8 | 8 |
> | store   | 14925 |  .9647058823529411 |   9 | 9 |
> | store   |  4063 | 1. |  10 |10 |
> | store   |  9029 | 1. |  10 |10 |
> | web |  7539 |  .5900 |   1 | 1 |
> | web |  3337 |  .6265060240963855 |   2 | 2 |
> | web | 15597 |  .6619718309859154 |   3 | 3 |
> | web |  2915 |  .6986301369863013 |   4 | 4 |
> | web | 11933 |  .7171717171717171 |   5 | 5 |
> | web |  3305 |  .7375 |   6 |16 |
> | web |   483 |  .8000 |   7 | 6 |
> | web |85 |  .8571428571428571 |   8 | 7 |
> | web |97 |  .9036144578313253 |   9 | 8 |
> | web |   117 |  .9250 |  10 | 9 |
> | web |  5299 |  .92708333 |  11 |10 |
> +-+---++-+---+
> {noformat}

[jira] [Closed] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-29 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13864.
--

PR fixed the issue. Nice work, [~smilegator]

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Miller]
> [LJNCBAAA,George,Gamez]
> [LIOP,Derek,Allen]
> [EADJ,Ruth,Carroll]
> [JGMM,Richard,Larson]
> [PKIK,Wendy,Horvath]
> [FJHF,Larissa,Roy]
> [EPOG,Felisha,Mendes]
> [EKJL,Aisha,Carlson]
> [HNFH,Rebecca,Wilson]
> [IBFCBAAA,Ruth,Grantham]
> [OPDL,Ann,Pence]
> [NIPL,Eric,Lawrence]
> [OCIC,Zachary,Pennington]
> [OFLC,James,Taylor]
> [GEHI,Tyler,Miller]
> [CADP,Cristobal,Thomas]
> [JIAL,Santos,Gutierrez]
> [PMMBBAAA,Paul,Jordan]
> [DIIO,David,Carroll]
> [DFKABAAA,Latoya,Craft]
> [HMOI,Grace,Henderson]
> [PPIBBAAA,Candice,Lee]
> [JONHBAAA,Warren,Orozco]
> [GNDA,Terry,Mcdowell]
> [CIJM,Elizabeth,Thomas]
> [DIJGBAAA,Ruth,Sanders]
> [NFBDBAAA,Vernice,Fernandez]
> [IDKF,Michael,Mack]
> [IMHB,Kathy,Knowles]
> [LHMC,Brooke,Nelson]
> [CFCGBAAA,Marcus,Sanders]
> [NJHCBAAA,Christopher,Schreiber]
> [PDFB,Terrance,Banks]
> [ANFA,Philip,Banks]
> [IADEBAAA,Diane,Aldridge]
> [ICHF,Linda,Mccoy]
> [CFEN,Christopher,Dawson]
> [KOJJ,Gracie,Mendoza]
> [FOJA,Don,Castillo]
> [FGPG,Albert,Wadsworth]
> [KJBK,Georgia,Scott]
> [EKFP,Annika,Chin]
> [IBAEBAAA,Sandra,Wilson]
> [MFFL,Margret,Gray]
> [KNAK,Gladys,Banks]
> [CJDI,James,Kerr]
> [OBADBAAA,Elizabeth,Burnham]
> [AMGD,Kenneth,Harlan]
> [HJLA,Audrey,Beltran]
> [AOPFBAAA,Jerry,Fields]
> [CNAGBAAA,Virginia,May]
> [HGOABAAA,Sonia,White]
> [KBCABAAA,Debra,Bell]
> [NJAG,Allen,Hood]
> [MMOBBAAA,Margaret,Smith]
> [NGDBBAAA,Carlos,Jewell]
> [FOGI,Michelle,Greene]
> [JEKFBAAA,Norma,Burkholder]
> [OCAJ,Jenna,Staton]
> [PFCL,Felicia,Neville]
> [DLHBBAAA,Henry,Bertrand]
> [DBEFBAAA,Bennie,Bowers]
> [DCKO,Robert,Gonzalez]
> [KKGE,Katie,Dunbar]
> [GFMDBAAA,Kathleen,Gibson]
> [IJEM,Charlie,Cummings]
> [KJBL,Kerry,Davis]
> [JKBN,Julie,Kern]
> [MDCA,Louann,Hamel]
> [EOAK,Molly,Benjamin]
> [IBHH,Jennifer,Ballard]
> [PJEN,Ashley,Norton]
> [KLHHBAAA,Manuel,Castaneda]
> [IMHHBAAA,Lillian,Davidson]
> [GHPBBAAA,Nick,Mendez]
> [BNBB,Irma,Smith]
> [FBAH,Michael,Williams]
> [PEHEBAAA,Edith,Molina]
> [FMHI,Emilio,Darling]
> [KAEC,Milton,Mackey]
> [OCDJ,Nina,Sanchez]
> [FGIG,Eduardo,Miller]
> [FHACBAAA,null,null]
> [HMJN,Ryan,Baptiste]
> [HHCABAAA,William,Stewart]
> {noformat}
> Expected results:
> {noformat}
> +--+-++
> | CUSTOMER_ID  | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME |
> +--+-++
> | AMGD | Kenneth | Harlan |
> | ANFA | Philip  | Banks  |
> | AOPFBAAA | Jerry   | Fields |
> | BLEIBAAA | Paula   | Wakefield  |
> | BNBB | Irma| Smith  |
> | CADP | Cristobal   | Thomas |
> | CFCGBAAA | Marcus  | Sanders|
> | CFEN | Christopher | Dawson |
> | CIJM | Eliz

[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-29 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216483#comment-15216483
 ] 

JESSE CHEN commented on SPARK-13864:


Validated successfully. Returned the correct result set in order:

{noformat}
AMGDKenneth Harlan
ANFAPhilip  Banks
AOPFBAAAJerry   Fields
BLEIBAAAPaula   Wakefield
BNBBIrmaSmith
CADPCristobal   Thomas
CFCGBAAAMarcus  Sanders
CFENChristopher Dawson
CIJMElizabeth   Thomas
CJDIJames   Kerr
CNAGBAAAVirginiaMay
DBEFBAAABennie  Bowers
DCKORobert  Gonzalez
DFIEBAAAJohnGray
DFKABAAALatoya  Craft
DIIODavid   Carroll
DIJGBAAARuthSanders
DLHBBAAAHenry   Bertrand
EADJRuthCarroll
EJDLAlice   Wright
EKFPAnnika  Chin
EKJLAisha   Carlson
EOAKMolly   Benjamin
EPOGFelisha Mendes
FACEPriscilla   Miller
FBAHMichael Williams
FGIGEduardo Miller
FGPGAlbert  Wadsworth
FHACBAAA
FJHFLarissa Roy
FMHIEmilio  Darling
FOGIMichelleGreene
FOJADon Castillo
GEHITyler   Miller
GFMDBAAAKathleenGibson
GHPBBAAANickMendez
GNDATerry   Mcdowell
HGOABAAASonia   White
HHCABAAAWilliam Stewart
HJLAAudrey  Beltran
HMJNRyanBaptiste
HMOIGrace   Henderson
HNFHRebecca Wilson
IADEBAAADiane   Aldridge
IBAEBAAASandra  Wilson
IBFCBAAARuthGrantham
IBHHJenniferBallard
ICHFLinda   Mccoy
IDKFMichael Mack
IJEMCharlie Cummings
IMHBKathy   Knowles
IMHHBAAALillian Davidson
JEKFBAAANorma   Burkholder
JGMMRichard Larson
JIALSantos  Gutierrez
JKBNJulie   Kern
JONHBAAAWarren  Orozco
KAECMilton  Mackey
KBCABAAADebra   Bell
KJBKGeorgia Scott
KJBLKerry   Davis
KKGEKatie   Dunbar
KLHHBAAAManuel  Castaneda
KNAKGladys  Banks
KOJJGracie  Mendoza
LFKKIgnacio Miller
LHMCBrooke  Nelson
LIOPDerek   Allen
LJNCBAAAGeorge  Gamez
MDCALouann  Hamel
MFFLMargret Gray
MMOBBAAAMargaretSmith
NFBDBAAAVernice Fernandez
NGDBBAAACarlos  Jewell
NIPLEricLawrence
NJAGAllen   Hood
NJHCBAAAChristopher Schreiber
OBADBAAAElizabeth   Burnham
OCAJJenna   Staton
OCDJNinaSanchez
OCICZachary Pennington
OCLBBAAA
OFLCJames   Taylor
OPDLAnn Pence
PDFBTerranceBanks
PEHEBAAAEdith   Molina
PFCLFelicia Neville
PJENAshley  Norton
PKBCBAAAAndrea  White
PKIKWendy   Horvath
PMMBBAAAPaulJordan
PPIBBAAACandice Lee
{noformat}

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Mil

[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error

2016-03-28 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214834#comment-15214834
 ] 

JESSE CHEN commented on SPARK-13831:


Same in spark 2.0. 
Query 41 also returns the same error.

> TPC-DS Query 35 fails with the following compile error
> --
>
> Key: SPARK-13831
> URL: https://issues.apache.org/jira/browse/SPARK-13831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Roy Cecil
>
> TPC-DS Query 35 fails with the following compile error.
> Scala.NotImplementedError: 
> scala.NotImplementedError: No parse rules for ASTNode type: 864, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR 1, 439,797, 1370
>   TOK_SUBQUERY_OP 1, 439,439, 1370
> exists 1, 439,439, 1370
>   TOK_QUERY 1, 441,797, 1508
> Pasting Query 35 for easy reference.
> select
>   ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   count(*) cnt1,
>   min(cd_dep_count) cd_dep_count1,
>   max(cd_dep_count) cd_dep_count2,
>   avg(cd_dep_count) cd_dep_count3,
>   cd_dep_employed_count,
>   count(*) cnt2,
>   min(cd_dep_employed_count) cd_dep_employed_count1,
>   max(cd_dep_employed_count) cd_dep_employed_count2,
>   avg(cd_dep_employed_count) cd_dep_employed_count3,
>   cd_dep_college_count,
>   count(*) cnt3,
>   min(cd_dep_college_count) cd_dep_college_count1,
>   max(cd_dep_college_count) cd_dep_college_count2,
>   avg(cd_dep_college_count) cd_dep_college_count3
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN
>   (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_qoy < 4) ss_wh1
>   ON c.c_customer_sk = ss_wh1.ss_customer_sk
>  where
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk  as customer_sk
> from web_sales,date_dim
> where
>   ws_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>UNION ALL
> select cs_ship_customer_sk  as customer_sk
> from catalog_sales,date_dim
> where
>   cs_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-28 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214795#comment-15214795
 ] 

JESSE CHEN commented on SPARK-13820:


Happens in Spark 2.0 as well, in Spark 2.0 error is
{noformat}
== Error ==
scala.NotImplementedError: [Expression]: No parse rules for ASTNode type: 918, 
tree:
TOK_SUBQUERY_EXPR 22, 172, 292, 2 
:- TOK_SUBQUERY_OP 22, 172, 172, 2 
:  +- exists 22, 172, 172, 2 
+- TOK_QUERY 23, 174, 292, 15 

{noformat}

This feature affects a few TPCDS queries (so far 93 queries work...getting 
really close to 99 here).


> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14096:
---
Labels:   (was: tpcds-result-mismatch)

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * tmp2.avg_i_current_price
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>limit 100;
> {noformat}
> query 38 (succeeded)
> {

[jira] [Updated] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14096:
---
Affects Version/s: (was: 1.6.0)
   2.0.0
  Description: 
Trying to run TPCDS query 06 in spark-sql shell received the following error in 
the middle of a stage; but running another query 38 succeeded:

NPE:
{noformat}
16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose 
tasks have all completed, from pool 
16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
10.0 (TID 622) in 171 ms on localhost (30/200)
16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
task result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (org.apache.spark.util.BoundedPriorityQueue)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 15 more
16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose 
tasks have all completed, from pool 
16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
10.0 (TID 623) in 171 ms on localhost (31/200)
16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose 
tasks have all completed, from pool 
{noformat}

query 06 (caused the above NPE):
{noformat}
 select  a.ca_state state, count(*) cnt
 from customer_address a
 join customer c on a.ca_address_sk = c.c_current_addr_sk
 join store_sales s on c.c_customer_sk = s.ss_customer_sk
 join date_dim d on s.ss_sold_date_sk = d.d_date_sk
 join item i on s.ss_item_sk = i.i_item_sk
 join (select distinct d_month_seq
  from date_dim
   where d_year = 2001
and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
 join
  (select j.i_category, avg(j.i_current_price) as avg_i_current_price
 from item j group by j.i_category) tmp2 on tmp2.i_category = 
i.i_category
 where  
i.i_current_price > 1.2 * tmp2.avg_i_current_price
 group by a.ca_state
 having count(*) >= 10
 order by cnt 
   limit 100;

{noformat}

query 38 (succeeded)
{noformat}
select  count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
  where store_sales.ss_sold_date_sk = date_dim.d_date_sk
  and store_sales.ss_customer_sk = customer.c_customer_sk
  and d_month_seq between 1200 and 1200 + 11
  intersect
select distinct c_last_name, c_first_name, d_date

[jira] [Created] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-14096:
--

 Summary: SPARK-SQL CLI returns NPE
 Key: SPARK-14096
 URL: https://issues.apache.org/jira/browse/SPARK-14096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: JESSE CHEN


Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL has right answer but in wrong order (and there is an 'order by' in the 
query).

Actual results:
{noformat}
store,9797,0.8000,2,2]
[store,12641,0.81609195402298850575,3,3]
[store,6661,0.92207792207792207792,7,7]
[store,13013,0.94202898550724637681,8,8]
[store,9029,1.,10,10]
[web,15597,0.66197183098591549296,3,3]
[store,14925,0.96470588235294117647,9,9]
[store,4063,1.,10,10]
[catalog,8929,0.7625,7,7]
[store,11589,0.82653061224489795918,6,6]
[store,1171,0.82417582417582417582,5,5]
[store,9471,0.7750,1,1]
[catalog,12577,0.65591397849462365591,3,3]
[web,97,0.90361445783132530120,9,8]
[web,85,0.85714285714285714286,8,7]
[catalog,361,0.74647887323943661972,5,5]
[web,2915,0.69863013698630136986,4,4]
[web,117,0.9250,10,9]
[catalog,9295,0.77894736842105263158,9,9]
[web,3305,0.7375,6,16]
[catalog,16215,0.79069767441860465116,10,10]
[web,7539,0.5900,1,1]
[catalog,17543,0.57142857142857142857,1,1]
[catalog,3411,0.71641791044776119403,4,4]
[web,11933,0.71717171717171717172,5,5]
[catalog,14513,0.63541667,2,2]
[store,15839,0.81632653061224489796,4,4]
[web,3337,0.62650602409638554217,2,2]
[web,5299,0.92708333,11,10]
[catalog,8189,0.74698795180722891566,6,6]
[catalog,14869,0.77173913043478260870,8,8]
[web,483,0.8000,7,6]
{noformat}


Expected results:
{noformat}
+-+---++-+---+
| CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
+-+---++-+---+
| catalog | 17543 |  .5714285714285714 |   1 | 1 |
| catalog | 14513 |  .63541666 |   2 | 2 |
| catalog | 12577 |  .6559139784946236 |   3 | 3 |
| catalog |  3411 |  .7164179104477611 |   4 | 4 |
| catalog |   361 |  .7464788732394366 |   5 | 5 |
| catalog |  8189 |  .7469879518072289 |   6 | 6 |
| catalog |  8929 |  .7625 |   7 | 7 |
| catalog | 14869 |  .7717391304347826 |   8 | 8 |
| catalog |  9295 |  .7789473684210526 |   9 | 9 |
| catalog | 16215 |  .7906976744186046 |  10 |10 |
| store   |  9471 |  .7750 |   1 | 1 |
| store   |  9797 |  .8000 |   2 | 2 |
| store   | 12641 |  .8160919540229885 |   3 | 3 |
| store   | 15839 |  .8163265306122448 |   4 | 4 |
| store   |  1171 |  .8241758241758241 |   5 | 5 |
| store   | 11589 |  .8265306122448979 |   6 | 6 |
| store   |  6661 |  .9220779220779220 |   7 | 7 |
| store   | 13013 |  .9420289855072463 |   8 | 8 |
| store   | 14925 |  .9647058823529411 |   9 | 9 |
| store   |  4063 | 1. |  10 |10 |
| store   |  9029 | 1. |  10 |10 |
| web |  7539 |  .5900 |   1 | 1 |
| web |  3337 |  .6265060240963855 |   2 | 2 |
| web | 15597 |  .6619718309859154 |   3 | 3 |
| web |  2915 |  .6986301369863013 |   4 | 4 |
| web | 11933 |  .7171717171717171 |   5 | 5 |
| web |  3305 |  .7375 |   6 |16 |
| web |   483 |  .8000 |   7 | 6 |
| web |85 |  .8571428571428571 |   8 | 7 |
| web |97 |  .9036144578313253 |   9 | 8 |
| web |   117 |  .9250 |  10 | 9 |
| web |  5299 |  .92708333 |  11 |10 |
+-+---++-+---+
{noformat}

Query used:
{noformat}
-- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION
  select  
 'web' as channel
 ,web.item
 ,web.return_ratio
 ,web.return_rank
 ,web.currency_rank
 from (
select 
 item
,return_ratio
,currency_ratio
,rank() over (order by return_ratio) as return_rank
,rank() over (order by currency_ratio) as currency_rank
from
(   select ws.ws_item_sk as item
  

[jira] [Commented] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-22 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207239#comment-15207239
 ] 

JESSE CHEN commented on SPARK-13864:


Tried on two recent builds having issues running to completion. Something is 
broken. Looking into why...

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Miller]
> [LJNCBAAA,George,Gamez]
> [LIOP,Derek,Allen]
> [EADJ,Ruth,Carroll]
> [JGMM,Richard,Larson]
> [PKIK,Wendy,Horvath]
> [FJHF,Larissa,Roy]
> [EPOG,Felisha,Mendes]
> [EKJL,Aisha,Carlson]
> [HNFH,Rebecca,Wilson]
> [IBFCBAAA,Ruth,Grantham]
> [OPDL,Ann,Pence]
> [NIPL,Eric,Lawrence]
> [OCIC,Zachary,Pennington]
> [OFLC,James,Taylor]
> [GEHI,Tyler,Miller]
> [CADP,Cristobal,Thomas]
> [JIAL,Santos,Gutierrez]
> [PMMBBAAA,Paul,Jordan]
> [DIIO,David,Carroll]
> [DFKABAAA,Latoya,Craft]
> [HMOI,Grace,Henderson]
> [PPIBBAAA,Candice,Lee]
> [JONHBAAA,Warren,Orozco]
> [GNDA,Terry,Mcdowell]
> [CIJM,Elizabeth,Thomas]
> [DIJGBAAA,Ruth,Sanders]
> [NFBDBAAA,Vernice,Fernandez]
> [IDKF,Michael,Mack]
> [IMHB,Kathy,Knowles]
> [LHMC,Brooke,Nelson]
> [CFCGBAAA,Marcus,Sanders]
> [NJHCBAAA,Christopher,Schreiber]
> [PDFB,Terrance,Banks]
> [ANFA,Philip,Banks]
> [IADEBAAA,Diane,Aldridge]
> [ICHF,Linda,Mccoy]
> [CFEN,Christopher,Dawson]
> [KOJJ,Gracie,Mendoza]
> [FOJA,Don,Castillo]
> [FGPG,Albert,Wadsworth]
> [KJBK,Georgia,Scott]
> [EKFP,Annika,Chin]
> [IBAEBAAA,Sandra,Wilson]
> [MFFL,Margret,Gray]
> [KNAK,Gladys,Banks]
> [CJDI,James,Kerr]
> [OBADBAAA,Elizabeth,Burnham]
> [AMGD,Kenneth,Harlan]
> [HJLA,Audrey,Beltran]
> [AOPFBAAA,Jerry,Fields]
> [CNAGBAAA,Virginia,May]
> [HGOABAAA,Sonia,White]
> [KBCABAAA,Debra,Bell]
> [NJAG,Allen,Hood]
> [MMOBBAAA,Margaret,Smith]
> [NGDBBAAA,Carlos,Jewell]
> [FOGI,Michelle,Greene]
> [JEKFBAAA,Norma,Burkholder]
> [OCAJ,Jenna,Staton]
> [PFCL,Felicia,Neville]
> [DLHBBAAA,Henry,Bertrand]
> [DBEFBAAA,Bennie,Bowers]
> [DCKO,Robert,Gonzalez]
> [KKGE,Katie,Dunbar]
> [GFMDBAAA,Kathleen,Gibson]
> [IJEM,Charlie,Cummings]
> [KJBL,Kerry,Davis]
> [JKBN,Julie,Kern]
> [MDCA,Louann,Hamel]
> [EOAK,Molly,Benjamin]
> [IBHH,Jennifer,Ballard]
> [PJEN,Ashley,Norton]
> [KLHHBAAA,Manuel,Castaneda]
> [IMHHBAAA,Lillian,Davidson]
> [GHPBBAAA,Nick,Mendez]
> [BNBB,Irma,Smith]
> [FBAH,Michael,Williams]
> [PEHEBAAA,Edith,Molina]
> [FMHI,Emilio,Darling]
> [KAEC,Milton,Mackey]
> [OCDJ,Nina,Sanchez]
> [FGIG,Eduardo,Miller]
> [FHACBAAA,null,null]
> [HMJN,Ryan,Baptiste]
> [HHCABAAA,William,Stewart]
> {noformat}
> Expected results:
> {noformat}
> +--+-++
> | CUSTOMER_ID  | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME |
> +--+-++
> | AMGD | Kenneth | Harlan |
> | ANFA | Philip  | Banks  |
> | AOPFBAAA | Jerry   | Fields |
> | BLEIBAAA | Paula   | Wakefield  |
> | BNBB | Irma| Smith  |
> | CADP | Cristobal   | Thomas |
> | CFCGBAAA | Marcus  

[jira] [Closed] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-22 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13858.
--
Resolution: Not A Bug

Schema updates generated correct results in both spark 1.6 and 2.0. Good to 
close. 

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. 

[jira] [Closed] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set

2016-03-22 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13861.
--
Resolution: Duplicate

Fixed all schema discrepancies. 

> TPCDS query 40 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13861
> URL: https://issues.apache.org/jira/browse/SPARK-13861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 40 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABBD) ; I believe 5 
> rows are missing in total.
> Actual results:
> {noformat}
> [TN,AABD,0.0,-82.060899353]
> [TN,AACD,-216.54000234603882,158.0399932861328]
> [TN,AAHD,186.54999542236328,0.0]
> [TN,AALA,0.0,48.2254223633]
> [TN,ACGC,63.67999863624573,0.0]
> [TN,ACHC,102.6830517578,51.8838964844]
> [TN,ACKC,128.9235150146,44.8169482422]
> [TN,ACLD,205.43999433517456,-948.619930267334]
> [TN,ACOB,207.32000732421875,24.88389648438]
> [TN,ACPD,87.75,53.9900016784668]
> [TN,ADGB,44.310001373291016,222.4800033569336]
> [TN,ADKB,0.0,-471.8699951171875]
> [TN,AEAD,58.2400016784668,0.0]
> [TN,AEOC,19.9084741211,214.7076293945]
> [TN,AFAC,271.8199977874756,163.1699981689453]
> [TN,AFAD,2.349046325684,28.3169482422]
> [TN,AFDC,-378.0499496459961,-303.26999282836914]
> [TN,AGID,307.6099967956543,-19.29915527344]
> [TN,AHDE,80.574468689,-476.7200012207031]
> [TN,AHHA,8.27457763672,155.1276565552]
> [TN,AHJB,39.23999857902527,0.0]
> [TN,AIEC,82.3675750732,3.910858306885]
> [TN,AIEE,20.39618530273,-151.08999633789062]
> [TN,AIMC,24.46313354492,-150.330517578]
> [TN,AJAC,49.0915258789,82.084741211]
> [TN,AJCA,121.18000221252441,63.779998779296875]
> [TN,AJKB,27.94534057617,8.97267028809]
> [TN,ALBE,88.2599983215332,30.22542236328]
> [TN,ALCE,93.5245776367,92.0198092651]
> [TN,ALEC,64.179019165,15.1584741211]
> [TN,ALNB,4.19809265137,148.27000427246094]
> [TN,AMBE,28.44534057617,0.0]
> [TN,AMPB,0.0,131.92999839782715]
> [TN,ANFE,0.0,-137.3400115966797]
> [TN,AOIB,150.40999603271484,254.288058548]
> [TN,APJB,45.2745776367,334.482015991]
> [TN,APLA,50.2076293945,29.150001049041748]
> [TN,APLD,0.0,32.3838964844]
> [TN,BAPD,93.41999816894531,145.8699951171875]
> [TN,BBID,296.774577637,30.95084472656]
> [TN,BDCE,-1771.0800704956055,-54.779998779296875]
> [TN,BDDD,111.12000274658203,280.5899963378906]
> [TN,BDJA,0.0,79.5423706055]
> [TN,BEFD,0.0,3.429475479126]
> [TN,BEOD,269.838964844,297.5800061225891]
> [TN,BFMB,110.82999801635742,-941.4000930786133]
> [TN,BFNA,47.8661035156,0.0]
> [TN,BFOC,46.3415258789,83.5245776367]
> [TN,BHPC,27.378392334,77.61999893188477]
> [TN,BIDB,196.6199951171875,5.57171661377]
> [TN,BIGB,425.3399963378906,0.0]
> [TN,BIJB,209.6300048828125,0.0]
> [TN,BJFE,7.32923706055,55.1584741211]
> [TN,BKFA,0.0,138.14000129699707]
> [TN,BKMC,27.17076293945,54.970001220703125]
> [TN,BLDE,170.28999400138855,0.0]
> [TN,BNHB,58.0594277954,-337.8899841308594]
> [TN,BNID,54.41525878906,35.01504089355]
> [TN,BNLA,0.0,168.37999629974365]
> [TN,BNLD,0.0,96.4084741211]
> [TN,BNMC,202.40999698638916,49.52999830245972]
> [TN,BOCC,4.73019073486,69.83999633789062]
> [TN,BOMB,63.66999816894531,163.49000668525696]
> [TN,CAAA,121.91000366210938,0.0]
> [TN,CAAD,-1107.6099338531494,0.0]
> [TN,CAJC,115.8046594238,173.0519073486]
> [TN,CBCD,18.94534057617,226.38000106811523]
> [TN,CBFA,0.0,97.41000366210938]
> [TN,CBIA,2.14104904175,84.66000366210938]
> [TN,CBPB,95.44000244140625,26.6830517578]
> [TN,CCAB,160.43000602722168,135.8661035156]
> [TN,CCHD,0.0,12

[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200637#comment-15200637
 ] 

JESSE CHEN commented on SPARK-13865:


This maybe a TPC toolkit issue. Will be looking into this with John on my team 
who is one of the TPC board member. 

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200886#comment-15200886
 ] 

JESSE CHEN commented on SPARK-13865:


You rock!

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200869#comment-15200869
 ] 

JESSE CHEN commented on SPARK-13865:


I am onto that. Thanks.

Also, good to know the parsing error is gone in 2.0. Can't wait to get my hands 
on that soon.

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198558#comment-15198558
 ] 

JESSE CHEN commented on SPARK-13858:


Good job, Bo! I would like to test this on my cluster if you have a fix.

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | 

[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200850#comment-15200850
 ] 

JESSE CHEN commented on SPARK-13865:


Hive, big sql, db2 queries are all generated off corresponding query templates. 
Hive apparently generated the one I listed in the initial report (with JOINs). 
So this is what I am asking TPC to see why the variant exists in the templates.

Meanwhile, I tested the query, and as expected, Spark SQL isn't able to parse 
it with the following errors:

{noformat}
16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select 
count(*)  from ((select distinct c_last_name, c_first_name, d_datefrom 
store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = 
date_dim.d_date_sk  and store_sales.ss_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = 
date_dim.d_date_sk  and catalog_sales.cs_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = 
date_dim.d_date_sk  and web_sales.ws_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11) ) 
cool_cust
NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> 
partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
{noformat}



> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200850#comment-15200850
 ] 

JESSE CHEN edited comment on SPARK-13865 at 3/18/16 2:24 AM:
-

Hive, big sql, db2 queries are all generated off corresponding query templates. 
Hive apparently generated the one I listed in the initial report (with JOINs). 
So this is what I am asking TPC to see why the variant exists in the templates.

Meanwhile, I tested the query you found, and as expected, Spark SQL isn't able 
to parse it with the following errors:
Query:
{noformat}
select count(*)
from ((select distinct c_last_name, c_first_name, d_date
   from store_sales, date_dim, customer
   where store_sales.ss_sold_date_sk = date_dim.d_date_sk
 and store_sales.ss_customer_sk = customer.c_customer_sk
 and d_month_seq between 1200 and 1200+11)
   except
  (select distinct c_last_name, c_first_name, d_date
   from catalog_sales, date_dim, customer
   where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
 and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
 and d_month_seq between 1200 and 1200+11)
   except
  (select distinct c_last_name, c_first_name, d_date
   from web_sales, date_dim, customer
   where web_sales.ws_sold_date_sk = date_dim.d_date_sk
 and web_sales.ws_bill_customer_sk = customer.c_customer_sk
 and d_month_seq between 1200 and 1200+11)
) cool_cust
;

{noformat}
Error:
{noformat}
16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select 
count(*)  from ((select distinct c_last_name, c_first_name, d_datefrom 
store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = 
date_dim.d_date_sk  and store_sales.ss_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = 
date_dim.d_date_sk  and catalog_sales.cs_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = 
date_dim.d_date_sk  and web_sales.ws_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11) ) 
cool_cust
NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> 
partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
{noformat}




was (Author: jfc...@us.ibm.com):
Hive, big sql, db2 queries are all generated off corresponding query templates. 
Hive apparently generated the one I listed in the initial report (with JOINs). 
So this is what I am asking TPC to see why the variant exists in the templates.

Meanwhile, I tested the query, and as expected, Spark SQL isn't able to parse 
it with the following errors:

{noformat}
16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select 
count(*)  from ((select distinct c_last_name, c_first_name, d_datefrom 
store_sales, date_dim, customerwhere store_sales.ss_sold_date_sk = 
date_dim.d_date_sk  and store_sales.ss_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
catalog_sales, date_dim, customerwhere catalog_sales.cs_sold_date_sk = 
date_dim.d_date_sk  and catalog_sales.cs_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11)   
 except   (select distinct c_last_name, c_first_name, d_datefrom 
web_sales, date_dim, customerwhere web_sales.ws_sold_date_sk = 
date_dim.d_date_sk  and web_sales.ws_bill_customer_sk = 
customer.c_customer_sk  and d_month_seq between 1200 and 1200+11) ) 
cool_cust
NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> 
partitionedTableFunction | tableSource | subQuerySource | virtualTableSource )])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)

[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200177#comment-15200177
 ] 

JESSE CHEN commented on SPARK-13859:


Tested both q87 and q38 on the lab's cluster. 

With this modification (i.e., null-safe equals), both q87 and q38 returned 
correct results (per TPC) on both text and parquet.
Without this modification, both queries returned the wrong results.

Per TPC rules on vendor-specific syntax:

4.2.3.4 The following query modifications are minor: 
c) Operators
2. Relational operators - Relational operators used in queries such as "<", 
">", "<>", "<=", and "=", may be replaced by equivalent vendor-specific 
operators, for example ".LT.", ".GT.", "!=" or "^=", ".LE.", and "==", 
respectively. 

This proposed modification however seems outside of allowed modifcation because 
it is a workaround to an issue where 
"Spark does not deal with nulls correctly under certain conditions."  If you 
look at other queries in TPC (which 72 of them 
returned correct results), there are this type of equals used all over. 

SO there is a inherent unsafe null operation in Spark that is **not related** 
to a) wrong table definition, or b) wrong query
syntax, or c) file format. Spark should do this "=" correctly and automatically.

These two queries provide excellent testcases for finding that bug and fixing 
it.

Jesse 









> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13859.
--
   Resolution: Not A Bug
Fix Version/s: 2.0.0

Solution is to revert back to original TPC query with INTERSECT & EXCEPT and 
validated with correct return results in Spark 2.0. The null-safe version will 
remain a variant for this query (for Hive). internal toolkit defect open RTC 
124749. 

> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
> Fix For: 2.0.0
>
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200877#comment-15200877
 ] 

JESSE CHEN commented on SPARK-13865:


I will open a bug against TPCDS toolkit for this. Will add bug report number 
here. 

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200149#comment-15200149
 ] 

JESSE CHEN commented on SPARK-13863:


Going to validate this also on my cluster. Nice find.

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> {noformat}
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]
> {noformat}
> Expected results:
> {noformat}
> +--+---+--+---+-+---

[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198559#comment-15198559
 ] 

JESSE CHEN commented on SPARK-13865:


yes sir.

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-18 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199885#comment-15199885
 ] 

JESSE CHEN commented on SPARK-13859:


Testing both Q87 and Q38. Back shortly with results.

> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-18 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202272#comment-15202272
 ] 

JESSE CHEN commented on SPARK-13832:


This is the vanilla TPC query:
{noformat}
  select
sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
   ,i_category
   ,i_class
   ,grouping(i_category)+grouping(i_class) as lochierarchy
   ,rank() over (
partition by grouping(i_category)+grouping(i_class),
case when grouping(i_class) = 0 then i_category end
order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
rank_within_parent
 from
store_sales
   ,date_dim   d1
   ,item
   ,store
 where
d1.d_year = 2001
 and d1.d_date_sk = ss_sold_date_sk
 and i_item_sk  = ss_item_sk
 and s_store_sk  = ss_store_sk
 and s_state in ('TN','TN','TN','TN',
 'TN','TN','TN','TN')
 group by rollup(i_category,i_class)
 order by
   lochierarchy desc
  ,case when lochierarchy = 0 then i_category end
  ,rank_within_parent
   limit 100;
{noformat}

The query fails in spark 2.0 with the following error:
{noformat}
6/03/18 15:09:37 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose 
tasks have all completed, from pool 
16/03/18 15:09:37 ERROR scheduler.TaskResultGetter: Exception while getting 
task result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (org.apache.spark.util.BoundedPriorityQueue)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:311)
at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:65)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:56)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:56)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1789)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:55)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{noformat}

With grouiping_id(), the query is:
{noformat}
  select
sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
   ,i_category
   ,i_class
   ,grouping_id(i_category)+grouping_id(i_class) as lochierarchy
   ,rank() over (
partition by grouping_id(i_category)+grouping_id(i_class),
case when grouping_id(i_class) = 0 then i_category end
order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
rank_within_parent
 from
store_sales
   ,date_dim   d1
   ,item
   ,store
 where
d1.d_year = 2001
 and d1.d_date_sk = ss_sold_date_sk
 and i_item_sk  = ss_item_sk
 and s_store_sk  = ss_store_sk
 and s_state in ('TN','TN','TN','TN',
 'TN','TN','TN','TN')
 group by rollup(i_category,i_class)
 order by
   lochierarchy desc
  ,case when lochierarchy = 0 then i_category end
  ,rank_within_parent
   limit 100;
-- end query 36 in stream 0 using template query36.tpl

  lochierarchy desc
  ,case when lochierarchy = 0 then i_category end
  ,rank_within_parent
   limit 100
{noformat}
Returned error:
{noformat}
16/03/18 15:13:01 INFO parser.ParseDriver: Parse completed.
Error in query: Columns of grouping_id (i_category#674) does not match grouping 
columns (i_category#674,i_class#672);

{noformat}

Something still fails at logical plan generation.


> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.s

[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-18 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202280#comment-15202280
 ] 

JESSE CHEN commented on SPARK-13865:


Solution is to revert back to original TPC query with INTERSECT & EXCEPT and 
validated with correct return results in Spark 2.0. The null-safe version will 
remain a variant for this query (for Hive). internal toolkit defect open RTC 
124749. 

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
> Fix For: 2.0.0
>
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-18 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13865.
--
   Resolution: Not A Bug
Fix Version/s: 2.0.0

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
> Fix For: 2.0.0
>
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-18 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN closed SPARK-13863.
--
Resolution: Workaround

fixed schema.

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> {noformat}
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]
> {noformat}
> Expected results:
> {noformat}
> +--+---+--+---+-+---+---+--+++--

[jira] [Updated] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13859:
---
Description: 
Testing Spark SQL using TPC queries. Query 38 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL returns count of 0, answer set reports 107.

Actual results:
[0]

Expected:
+-+
|   1 |
+-+
| 107 |
+-+

query used:
-- start query 38 in stream 0 using template query38.tpl and seed QUALIFICATION
 select  count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales
 JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
 JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
where d_month_seq between 1200 and 1200 + 11) tmp1
  JOIN
(select distinct c_last_name, c_first_name, d_date
from catalog_sales
 JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
 JOIN customer ON catalog_sales.cs_bill_customer_sk = 
customer.c_customer_sk
where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and (tmp1.d_date 
= tmp2.d_date) 
  JOIN
(
select distinct c_last_name, c_first_name, d_date
from web_sales
 JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
 JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and (tmp1.d_date 
= tmp3.d_date) 
  limit 100
 ;
-- end query 38 in stream 0 using template query38.tpl


  was:
Testing Spark SQL using TPC queries. Query 38 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL returns count of 0, answer set reports 107.

Actual results:
[0]

Expected:
+-+
|   1 |
+-+
| 107 |
+-+


> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> [0]
> Expected:
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> query used:
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13858:
---
Description: 
Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
other rows are missing as well.

Actual results:
[null,AABD,2565,1922]
[null,AAHD,2956,2052]
[null,AALA,2042,1793]
[null,ACGC,2373,1771]
[null,ACKC,2321,1856]
[null,ACOB,1504,1397]
[null,ADKB,1820,2163]
[null,AEAD,2631,1965]
[null,AEOC,1659,1798]
[null,AFAC,1965,1705]
[null,AFAD,1769,1313]
[null,AHDE,2700,1985]
[null,AHHA,1578,1082]
[null,AIEC,1756,1804]
[null,AIMC,3603,2951]
[null,AJAC,2109,1989]
[null,AJKB,2573,3540]
[null,ALBE,3458,2992]
[null,ALCE,1720,1810]
[null,ALEC,2569,1946]
[null,ALNB,2552,1750]
[null,ANFE,2022,2269]
[null,AOIB,2982,2540]
[null,APJB,2344,2593]
[null,BAPD,2182,2787]
[null,BDCE,2844,2069]
[null,BDDD,2417,2537]
[null,BDJA,1584,1666]
[null,BEOD,2141,2649]
[null,BFCC,2745,2020]
[null,BFMB,1642,1364]
[null,BHPC,1923,1780]
[null,BIDB,1956,2836]
[null,BIGB,2023,2344]
[null,BIJB,1977,2728]
[null,BJFE,1891,2390]
[null,BLDE,1983,1797]
[null,BNID,2485,2324]
[null,BNLD,2385,2786]
[null,BOMB,2291,2092]
[null,CAAA,2233,2560]
[null,CBCD,1540,2012]
[null,CBIA,2394,2122]
[null,CBPB,1790,1661]
[null,CCMD,2654,2691]
[null,CDBC,1804,2072]
[null,CFEA,1941,1567]
[null,CGFD,2123,2265]
[null,CHPC,2933,2174]
[null,CIGD,2618,2399]
[null,CJCB,2728,2367]
[null,CJLA,1350,1732]
[null,CLAE,2578,2329]
[null,CLGA,1842,1588]
[null,CLLB,3418,2657]
[null,CLOB,3115,2560]
[null,CMAD,1991,2243]
[null,CMJA,1261,1855]
[null,CMLA,3288,2753]
[null,CMPD,1320,1676]
[null,CNGB,2340,2118]
[null,CNHD,3519,3348]
[null,CNPC,2561,1948]
[null,DCPC,2664,2627]
[null,DDHA,1313,1926]
[null,DDND,1109,835]
[null,DEAA,2141,1847]
[null,DEJA,3142,2723]
[null,DFKB,1470,1650]
[null,DGCC,2113,2331]
[null,DGFC,2201,2928]
[null,DHPA,2467,2133]
[null,DMBA,3085,2087]
[null,DPAB,3494,3081]
[null,EAEC,2133,2148]
[null,EAPA,1560,1275]
[null,ECGC,2815,3307]
[null,EDPD,2731,1883]
[null,EEEC,2024,1902]
[null,EEMC,2624,2387]
[null,EFFA,2047,1878]
[null,EGJA,2403,2633]
[null,EGMA,2784,2772]
[null,EGOC,2389,1753]
[null,EHFD,1940,1420]
[null,EHLB,2320,2057]
[null,EHPA,1898,1853]
[null,EIPB,2930,2326]
[null,EJAE,2582,1836]
[null,EJIB,2257,1681]
[null,EJJA,2791,1941]
[null,EJJD,3410,2405]
[null,EJNC,2472,2067]
[null,EJPD,1219,1229]
[null,EKEB,2047,1713]
[null,EMEA,2502,1897]
[null,EMKC,2362,2042]
[null,ENAC,2011,1909]
[null,ENFB,2507,2162]
[null,ENOD,3371,2709]


Expected results:
+--+--++---+
| W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
+--+--++---+
| Bad cards must make. | AACD |   1889 |  2168 |
| Bad cards must make. | AAHD |   2739 |  2039 |
| Bad cards must make. | ABDA |   1717 |  1782 |
| Bad cards must make. | ACGC |   2296 |  2276 |
| Bad cards must make. | ACKC |   2443 |  1878 |
| Bad cards must make. | ACOB |   2705 |  2428 |
| Bad cards must make. | ADGB |   2242 |  2759 |
| Bad cards must make. | ADKB |   2138 |  2456 |
| Bad cards must make. | AEAD |   2914 |  2237 |
| Bad cards must make. | AEOC |   1797 |  2073 |
| Bad cards must make. | AFAC |   2058 |  2734 |
| Bad cards must make. | AFAD |   2173 |  2515 |
| Bad cards must make. | AFDC |   2309 |  2277 |

[jira] [Updated] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13865:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> [47555]
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> Query used:
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13864) TPCDS query 74 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13864:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 74 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13864
> URL: https://issues.apache.org/jira/browse/SPARK-13864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 74 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Spark SQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> [BLEIBAAA,Paula,Wakefield]
> [DFIEBAAA,John,Gray]
> [OCLBBAAA,null,null]
> [PKBCBAAA,Andrea,White]
> [EJDL,Alice,Wright]
> [FACE,Priscilla,Miller]
> [LFKK,Ignacio,Miller]
> [LJNCBAAA,George,Gamez]
> [LIOP,Derek,Allen]
> [EADJ,Ruth,Carroll]
> [JGMM,Richard,Larson]
> [PKIK,Wendy,Horvath]
> [FJHF,Larissa,Roy]
> [EPOG,Felisha,Mendes]
> [EKJL,Aisha,Carlson]
> [HNFH,Rebecca,Wilson]
> [IBFCBAAA,Ruth,Grantham]
> [OPDL,Ann,Pence]
> [NIPL,Eric,Lawrence]
> [OCIC,Zachary,Pennington]
> [OFLC,James,Taylor]
> [GEHI,Tyler,Miller]
> [CADP,Cristobal,Thomas]
> [JIAL,Santos,Gutierrez]
> [PMMBBAAA,Paul,Jordan]
> [DIIO,David,Carroll]
> [DFKABAAA,Latoya,Craft]
> [HMOI,Grace,Henderson]
> [PPIBBAAA,Candice,Lee]
> [JONHBAAA,Warren,Orozco]
> [GNDA,Terry,Mcdowell]
> [CIJM,Elizabeth,Thomas]
> [DIJGBAAA,Ruth,Sanders]
> [NFBDBAAA,Vernice,Fernandez]
> [IDKF,Michael,Mack]
> [IMHB,Kathy,Knowles]
> [LHMC,Brooke,Nelson]
> [CFCGBAAA,Marcus,Sanders]
> [NJHCBAAA,Christopher,Schreiber]
> [PDFB,Terrance,Banks]
> [ANFA,Philip,Banks]
> [IADEBAAA,Diane,Aldridge]
> [ICHF,Linda,Mccoy]
> [CFEN,Christopher,Dawson]
> [KOJJ,Gracie,Mendoza]
> [FOJA,Don,Castillo]
> [FGPG,Albert,Wadsworth]
> [KJBK,Georgia,Scott]
> [EKFP,Annika,Chin]
> [IBAEBAAA,Sandra,Wilson]
> [MFFL,Margret,Gray]
> [KNAK,Gladys,Banks]
> [CJDI,James,Kerr]
> [OBADBAAA,Elizabeth,Burnham]
> [AMGD,Kenneth,Harlan]
> [HJLA,Audrey,Beltran]
> [AOPFBAAA,Jerry,Fields]
> [CNAGBAAA,Virginia,May]
> [HGOABAAA,Sonia,White]
> [KBCABAAA,Debra,Bell]
> [NJAG,Allen,Hood]
> [MMOBBAAA,Margaret,Smith]
> [NGDBBAAA,Carlos,Jewell]
> [FOGI,Michelle,Greene]
> [JEKFBAAA,Norma,Burkholder]
> [OCAJ,Jenna,Staton]
> [PFCL,Felicia,Neville]
> [DLHBBAAA,Henry,Bertrand]
> [DBEFBAAA,Bennie,Bowers]
> [DCKO,Robert,Gonzalez]
> [KKGE,Katie,Dunbar]
> [GFMDBAAA,Kathleen,Gibson]
> [IJEM,Charlie,Cummings]
> [KJBL,Kerry,Davis]
> [JKBN,Julie,Kern]
> [MDCA,Louann,Hamel]
> [EOAK,Molly,Benjamin]
> [IBHH,Jennifer,Ballard]
> [PJEN,Ashley,Norton]
> [KLHHBAAA,Manuel,Castaneda]
> [IMHHBAAA,Lillian,Davidson]
> [GHPBBAAA,Nick,Mendez]
> [BNBB,Irma,Smith]
> [FBAH,Michael,Williams]
> [PEHEBAAA,Edith,Molina]
> [FMHI,Emilio,Darling]
> [KAEC,Milton,Mackey]
> [OCDJ,Nina,Sanchez]
> [FGIG,Eduardo,Miller]
> [FHACBAAA,null,null]
> [HMJN,Ryan,Baptiste]
> [HHCABAAA,William,Stewart]
> Expected results:
> +--+-++
> | CUSTOMER_ID  | CUSTOMER_FIRST_NAME | CUSTOMER_LAST_NAME |
> +--+-++
> | AMGD | Kenneth | Harlan |
> | ANFA | Philip  | Banks  |
> | AOPFBAAA | Jerry   | Fields |
> | BLEIBAAA | Paula   | Wakefield  |
> | BNBB | Irma| Smith  |
> | CADP | Cristobal   | Thomas |
> | CFCGBAAA | Marcus  | Sanders|
> | CFEN | Christopher | Dawson |
> | CIJM | Elizabeth   | Thomas |
>

[jira] [Updated] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13862:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> Expected results:
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 | 2 |
> | store   | 12641 |  .8160919540229885 |   3 | 3 |
> | store   | 15839 |  .8163265306122448 |   4 | 4 |
> | store   |  1171 |  .8241758241758241 |   5 | 5 |
> | store   | 11589 |  .8265306122448979 |   6 | 6 |
> | store   |  6661 |  .9220779220779220 |   7 | 7 |
> | store   | 13013 |  .9420289855072463 |   8 | 8 |
> | store   | 14925 |  .9647058823529411 |   9 | 9 |
> | store   |  4063 | 1. |  10 |10 |
> | store   |  9029 | 1. |  10 |10 |
> | web |  7539 |  .5900 |   1 | 1 |
> | web |  3337 |  .6265060240963855 |   2 | 2 |
> | web | 15597 |  .6619718309859154 |   3 | 3 |
> | web |  2915 |  .6986301369863013 |   4 | 4 |
> | web | 11933 |  .7171717171717171 |   5 | 5 |
> | web |  3305 |  .7375 |   6 |16 |
> | web |   483 |  .8000 |   7 | 6 |
> | web |85 |  .8571428571428571 |   8 | 7 |
> | web |97 |  .9036144578313253 |   9 | 8 |
> | web |   117 |  .9250 |  10 | 9 |
> | web |  5299 |  .92708333 |  11 |10 |
> +-+---++-+---+
> Query used:
> -- start query 49 in stream 0 usin

[jira] [Updated] (SPARK-13861) TPCDS query 40 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13861:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 40 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13861
> URL: https://issues.apache.org/jira/browse/SPARK-13861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 40 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABBD) ; I believe 5 
> rows are missing in total.
> Actual results:
> [TN,AABD,0.0,-82.060899353]
> [TN,AACD,-216.54000234603882,158.0399932861328]
> [TN,AAHD,186.54999542236328,0.0]
> [TN,AALA,0.0,48.2254223633]
> [TN,ACGC,63.67999863624573,0.0]
> [TN,ACHC,102.6830517578,51.8838964844]
> [TN,ACKC,128.9235150146,44.8169482422]
> [TN,ACLD,205.43999433517456,-948.619930267334]
> [TN,ACOB,207.32000732421875,24.88389648438]
> [TN,ACPD,87.75,53.9900016784668]
> [TN,ADGB,44.310001373291016,222.4800033569336]
> [TN,ADKB,0.0,-471.8699951171875]
> [TN,AEAD,58.2400016784668,0.0]
> [TN,AEOC,19.9084741211,214.7076293945]
> [TN,AFAC,271.8199977874756,163.1699981689453]
> [TN,AFAD,2.349046325684,28.3169482422]
> [TN,AFDC,-378.0499496459961,-303.26999282836914]
> [TN,AGID,307.6099967956543,-19.29915527344]
> [TN,AHDE,80.574468689,-476.7200012207031]
> [TN,AHHA,8.27457763672,155.1276565552]
> [TN,AHJB,39.23999857902527,0.0]
> [TN,AIEC,82.3675750732,3.910858306885]
> [TN,AIEE,20.39618530273,-151.08999633789062]
> [TN,AIMC,24.46313354492,-150.330517578]
> [TN,AJAC,49.0915258789,82.084741211]
> [TN,AJCA,121.18000221252441,63.779998779296875]
> [TN,AJKB,27.94534057617,8.97267028809]
> [TN,ALBE,88.2599983215332,30.22542236328]
> [TN,ALCE,93.5245776367,92.0198092651]
> [TN,ALEC,64.179019165,15.1584741211]
> [TN,ALNB,4.19809265137,148.27000427246094]
> [TN,AMBE,28.44534057617,0.0]
> [TN,AMPB,0.0,131.92999839782715]
> [TN,ANFE,0.0,-137.3400115966797]
> [TN,AOIB,150.40999603271484,254.288058548]
> [TN,APJB,45.2745776367,334.482015991]
> [TN,APLA,50.2076293945,29.150001049041748]
> [TN,APLD,0.0,32.3838964844]
> [TN,BAPD,93.41999816894531,145.8699951171875]
> [TN,BBID,296.774577637,30.95084472656]
> [TN,BDCE,-1771.0800704956055,-54.779998779296875]
> [TN,BDDD,111.12000274658203,280.5899963378906]
> [TN,BDJA,0.0,79.5423706055]
> [TN,BEFD,0.0,3.429475479126]
> [TN,BEOD,269.838964844,297.5800061225891]
> [TN,BFMB,110.82999801635742,-941.4000930786133]
> [TN,BFNA,47.8661035156,0.0]
> [TN,BFOC,46.3415258789,83.5245776367]
> [TN,BHPC,27.378392334,77.61999893188477]
> [TN,BIDB,196.6199951171875,5.57171661377]
> [TN,BIGB,425.3399963378906,0.0]
> [TN,BIJB,209.6300048828125,0.0]
> [TN,BJFE,7.32923706055,55.1584741211]
> [TN,BKFA,0.0,138.14000129699707]
> [TN,BKMC,27.17076293945,54.970001220703125]
> [TN,BLDE,170.28999400138855,0.0]
> [TN,BNHB,58.0594277954,-337.8899841308594]
> [TN,BNID,54.41525878906,35.01504089355]
> [TN,BNLA,0.0,168.37999629974365]
> [TN,BNLD,0.0,96.4084741211]
> [TN,BNMC,202.40999698638916,49.52999830245972]
> [TN,BOCC,4.73019073486,69.83999633789062]
> [TN,BOMB,63.66999816894531,163.49000668525696]
> [TN,CAAA,121.91000366210938,0.0]
> [TN,CAAD,-1107.6099338531494,0.0]
> [TN,CAJC,115.8046594238,173.0519073486]
> [TN,CBCD,18.94534057617,226.38000106811523]
> [TN,CBFA,0.0,97.41000366210938]
> [TN,CBIA,2.14104904175,84.66000366210938]
> [TN,CBPB,95.44000244140625,26.6830517578]
> [TN,CCAB,160.43000602722168,135.8661035156]
> [TN,CCHD,0.0,121.62000274658203]
> [TN,

[jira] [Updated] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13863:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]
> Expected results:
> +--+---+--+---+-+---+---+--+++++

[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193800#comment-15193800
 ] 

JESSE CHEN commented on SPARK-13862:


tpcds-result-mismatch

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> Expected results:
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 | 2 |
> | store   | 12641 |  .8160919540229885 |   3 | 3 |
> | store   | 15839 |  .8163265306122448 |   4 | 4 |
> | store   |  1171 |  .8241758241758241 |   5 | 5 |
> | store   | 11589 |  .8265306122448979 |   6 | 6 |
> | store   |  6661 |  .9220779220779220 |   7 | 7 |
> | store   | 13013 |  .9420289855072463 |   8 | 8 |
> | store   | 14925 |  .9647058823529411 |   9 | 9 |
> | store   |  4063 | 1. |  10 |10 |
> | store   |  9029 | 1. |  10 |10 |
> | web |  7539 |  .5900 |   1 | 1 |
> | web |  3337 |  .6265060240963855 |   2 | 2 |
> | web | 15597 |  .6619718309859154 |   3 | 3 |
> | web |  2915 |  .6986301369863013 |   4 | 4 |
> | web | 11933 |  .7171717171717171 |   5 | 5 |
> | web |  3305 |  .7375 |   6 |16 |
> | web |   483 |  .8000 |   7 | 6 |
> | web |85 |  .8571428571428571 |   8 | 7 |
> | web |97 |  .9036144578313253 |   9 | 8 |
> | web |   117 |  .9250 |  10 | 9 |
> | web |  5299 |  .92708333 |  11 |10 |
> +-+---++-+---+
> Query used:
> -- start query 49 in stream 0 using templa

[jira] [Updated] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13860:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335043,1,15427,2,333.25,1.2724770126308678]
> [1,15647,1,201.66,1.2857931876095743,1,15647,2,249.25,1.3648172990142162]
> [1,16079,1,280.5,1.2444757416128578,1,16079,2,361.25,1.0737

[jira] [Updated] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13859:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> [0]
> Expected:
> +-+
> |   1 |
> +-+
> | 107 |
> +-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13858:
---
Labels: tpcds-result-mismatch  (was: )

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> Expected results:
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards must make. | ABDA |   1717 |  

[jira] [Updated] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13863:
---
Description: 
Testing Spark SQL using TPC queries. Query 66 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
SparkSQL returns 6355232.185385704, expected 6355232.31

Actual results:
[null,null,Fairview,Williamson County,TN,United 
States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
[Bad cards must make.,621234,Fairview,Williamson County,TN,United 
States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
[Conventional childr,977787,Fairview,Williamson County,TN,United 
States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
[Doors canno,294242,Fairview,Williamson County,TN,United 
States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
[Important issues liv,138504,Fairview,Williamson County,TN,United 
States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]

Expected results:
+--+---+--+---+-+---+---+--+++++++++++++---+---+---+---+---+---+---+---+---+---+---+---++++++---

[jira] [Updated] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-13865:
---
Description: 
Testing Spark SQL using TPC queries. Query 87 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL returns count of 47555, answer set expects 47298.

Actual results:
[47555]


Expected:
+---+
| 1 |
+---+
| 47298 |
+---+

Query used:
-- start query 87 in stream 0 using template query87.tpl and seed QUALIFICATION
select count(*) 
from 
 (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
ddate1, 1 as notnull1
   from store_sales
JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
   where
 d_month_seq between 1200 and 1200+11
   ) tmp1
   left outer join
  (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
ddate2, 1 as notnull2
   from catalog_sales
JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
JOIN customer ON catalog_sales.cs_bill_customer_sk = 
customer.c_customer_sk
   where 
 d_month_seq between 1200 and 1200+11
   ) tmp2 
  on (tmp1.cln1 = tmp2.cln2)
  and (tmp1.cfn1 = tmp2.cfn2)
  and (tmp1.ddate1= tmp2.ddate2)
   left outer join
  (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
ddate3, 1 as notnull3
   from web_sales
JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
   where 
 d_month_seq between 1200 and 1200+11
   ) tmp3 
  on (tmp1.cln1 = tmp3.cln3)
  and (tmp1.cfn1 = tmp3.cfn3)
  and (tmp1.ddate1= tmp3.ddate3)
where  
notnull2 is null and notnull3 is null  
;
-- end query 87 in stream 0 using template query87.tpl



  was:
Testing Spark SQL using TPC queries. Query 74 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

Spark SQL has right answer but in wrong order (and there is an 'order by' in 
the query).

Actual results:
[BLEIBAAA,Paula,Wakefield]
[DFIEBAAA,John,Gray]
[OCLBBAAA,null,null]
[PKBCBAAA,Andrea,White]
[EJDL,Alice,Wright]
[FACE,Priscilla,Miller]
[LFKK,Ignacio,Miller]
[LJNCBAAA,George,Gamez]
[LIOP,Derek,Allen]
[EADJ,Ruth,Carroll]
[JGMM,Richard,Larson]
[PKIK,Wendy,Horvath]
[FJHF,Larissa,Roy]
[EPOG,Felisha,Mendes]
[EKJL,Aisha,Carlson]
[HNFH,Rebecca,Wilson]
[IBFCBAAA,Ruth,Grantham]
[OPDL,Ann,Pence]
[NIPL,Eric,Lawrence]
[OCIC,Zachary,Pennington]
[OFLC,James,Taylor]
[GEHI,Tyler,Miller]
[CADP,Cristobal,Thomas]
[JIAL,Santos,Gutierrez]
[PMMBBAAA,Paul,Jordan]
[DIIO,David,Carroll]
[DFKABAAA,Latoya,Craft]
[HMOI,Grace,Henderson]
[PPIBBAAA,Candice,Lee]
[JONHBAAA,Warren,Orozco]
[GNDA,Terry,Mcdowell]
[CIJM,Elizabeth,Thomas]
[DIJGBAAA,Ruth,Sanders]
[NFBDBAAA,Vernice,Fernandez]
[IDKF,Michael,Mack]
[IMHB,Kathy,Knowles]
[LHMC,Brooke,Nelson]
[CFCGBAAA,Marcus,Sanders]
[NJHCBAAA,Christopher,Schreiber]
[PDFB,Terrance,Banks]
[ANFA,Philip,Banks]
[IADEBAAA,Diane,Aldridge]
[ICHF,Linda,Mccoy]
[CFEN,Christopher,Dawson]
[KOJJ,Gracie,Mendoza]
[FOJA,Don,Castillo]
[FGPG,Albert,Wadsworth]
[KJBK,Georgia,Scott]
[EKFP,Annika,Chin]
[IBAEBAAA,Sandra,Wilson]
[MFFL,Margret,Gray]
[KNAK,Gladys,Banks]
[CJDI,James,Kerr]
[OBADBAAA,Elizabeth,Burnham]
[AMGD,Kenneth,Harlan]
[HJLA,Audrey,Beltran]
[AOPFBAAA,Jerry,Fields]
[CNAGBAAA,Virginia,May]
[HGOABAAA,Sonia,White]
[KBCABAAA,Debra,Bell]
[NJAG,Allen,Hood]
[MMOBBAAA,Margaret,Smith]
[NGDBBAAA,Carlos,Jewell]
[FOGI,Michelle,Greene]
[JEKFBAAA,Norma,Burkholder]
[OCAJ,Jenna,Staton]
[PFCL,Felicia,Neville]
[DLHBBAAA,Henry,Bertrand]
[DBEFBAAA,Bennie,Bowers]
[DCKO,Robert,Gonzalez]
[KKGE,Katie,Dunbar]
[GFMDBAAA,Kathleen,Gibson]
[IJEM,Charlie,Cummings]
[KJBL,Kerry,Davis]
[JKBN,Julie,Kern]
[MDCA,Louann,Hamel]
[EOAK,Molly,Benjamin]
[IBHH,Jennifer,Ballard]
[PJEN,Ashley,Norton]
[KLHHBAAA,Manuel,Castaneda]
[IMHHBAAA,Lillian,Davidson]
[GHPBBAAA,N

[jira] [Created] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-14 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-13863:
--

 Summary: TPCDS query 66 returns wrong results compared to TPC 
official result set 
 Key: SPARK-13863
 URL: https://issues.apache.org/jira/browse/SPARK-13863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: JESSE CHEN


Testing Spark SQL using TPC queries. Query 49 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL has right answer but in wrong order (and there is an 'order by' in the 
query).

Actual results:
store,9797,0.8000,2,2]
[store,12641,0.81609195402298850575,3,3]
[store,6661,0.92207792207792207792,7,7]
[store,13013,0.94202898550724637681,8,8]
[store,9029,1.,10,10]
[web,15597,0.66197183098591549296,3,3]
[store,14925,0.96470588235294117647,9,9]
[store,4063,1.,10,10]
[catalog,8929,0.7625,7,7]
[store,11589,0.82653061224489795918,6,6]
[store,1171,0.82417582417582417582,5,5]
[store,9471,0.7750,1,1]
[catalog,12577,0.65591397849462365591,3,3]
[web,97,0.90361445783132530120,9,8]
[web,85,0.85714285714285714286,8,7]
[catalog,361,0.74647887323943661972,5,5]
[web,2915,0.69863013698630136986,4,4]
[web,117,0.9250,10,9]
[catalog,9295,0.77894736842105263158,9,9]
[web,3305,0.7375,6,16]
[catalog,16215,0.79069767441860465116,10,10]
[web,7539,0.5900,1,1]
[catalog,17543,0.57142857142857142857,1,1]
[catalog,3411,0.71641791044776119403,4,4]
[web,11933,0.71717171717171717172,5,5]
[catalog,14513,0.63541667,2,2]
[store,15839,0.81632653061224489796,4,4]
[web,3337,0.62650602409638554217,2,2]
[web,5299,0.92708333,11,10]
[catalog,8189,0.74698795180722891566,6,6]
[catalog,14869,0.77173913043478260870,8,8]
[web,483,0.8000,7,6]


Expected results:
+-+---++-+---+
| CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
+-+---++-+---+
| catalog | 17543 |  .5714285714285714 |   1 | 1 |
| catalog | 14513 |  .63541666 |   2 | 2 |
| catalog | 12577 |  .6559139784946236 |   3 | 3 |
| catalog |  3411 |  .7164179104477611 |   4 | 4 |
| catalog |   361 |  .7464788732394366 |   5 | 5 |
| catalog |  8189 |  .7469879518072289 |   6 | 6 |
| catalog |  8929 |  .7625 |   7 | 7 |
| catalog | 14869 |  .7717391304347826 |   8 | 8 |
| catalog |  9295 |  .7789473684210526 |   9 | 9 |
| catalog | 16215 |  .7906976744186046 |  10 |10 |
| store   |  9471 |  .7750 |   1 | 1 |
| store   |  9797 |  .8000 |   2 | 2 |
| store   | 12641 |  .8160919540229885 |   3 | 3 |
| store   | 15839 |  .8163265306122448 |   4 | 4 |
| store   |  1171 |  .8241758241758241 |   5 | 5 |
| store   | 11589 |  .8265306122448979 |   6 | 6 |
| store   |  6661 |  .9220779220779220 |   7 | 7 |
| store   | 13013 |  .9420289855072463 |   8 | 8 |
| store   | 14925 |  .9647058823529411 |   9 | 9 |
| store   |  4063 | 1. |  10 |10 |
| store   |  9029 | 1. |  10 |10 |
| web |  7539 |  .5900 |   1 | 1 |
| web |  3337 |  .6265060240963855 |   2 | 2 |
| web | 15597 |  .6619718309859154 |   3 | 3 |
| web |  2915 |  .6986301369863013 |   4 | 4 |
| web | 11933 |  .7171717171717171 |   5 | 5 |
| web |  3305 |  .7375 |   6 |16 |
| web |   483 |  .8000 |   7 | 6 |
| web |85 |  .8571428571428571 |   8 | 7 |
| web |97 |  .9036144578313253 |   9 | 8 |
| web |   117 |  .9250 |  10 | 9 |
| web |  5299 |  .92708333 |  11 |10 |
+-+---++-+---+

Query used:
-- start query 49 in stream 0 using template query49.tpl and seed QUALIFICATION
  select  
 'web' as channel
 ,web.item
 ,web.return_ratio
 ,web.return_rank
 ,web.currency_rank
 from (
select 
 item
,return_ratio
,currency_ratio
,rank() over (order by return_ratio) as return_rank
,rank() over (order by currency_ratio) as currency_rank
from
(   select ws.ws_item_sk as item

  1   2   >