[jira] [Commented] (SPARK-17122) Failed to drop database when use the database in Spark 2.0

2016-08-18 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425995#comment-15425995
 ] 

Yi Zhou commented on SPARK-17122:
-

Thanks [~dongjoon] for your clarity..

> Failed to drop database when use the database in Spark 2.0
> --
>
> Key: SPARK-17122
> URL: https://issues.apache.org/jira/browse/SPARK-17122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>
> Found below case was broken in Spark 2.0, but it can run successfully in 
> Spark 1.6. BTW, the case can also run successfully in Hive. Please see below 
> reproduces for details.
> Spark-SQL CLI:
> /usr/lib/spark/bin/spark-sql
> spark-sql> use test_db;
> spark-sql> DROP DATABASE IF EXISTS test_db CASCADE;
> 16/08/10 15:13:35 INFO execution.SparkSqlParser: Parsing command: DROP 
> DATABASE IF EXISTS test_db CASCADE
> Error in query: Can not drop current database `test_db`;
> Hive CLI:
> /usr/bin/hive
> hive> use test_db;
> OK
> hive> DROP DATABASE IF EXISTS test_db CASCADE;
> OK
> Time taken: 0.116 seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17122) Failed to drop database when use the database in Spark 2.0

2016-08-17 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425926#comment-15425926
 ] 

Yi Zhou commented on SPARK-17122:
-

Thanks [~dongjoon] for your quick response. But the problem is this break 
original behavior and caused user confused after upgrade 1.x to 2.0.  BTW, how 
to keep compatibility with Hive ? This operation passed in Hive but failed in 
Spark 2.0.

> Failed to drop database when use the database in Spark 2.0
> --
>
> Key: SPARK-17122
> URL: https://issues.apache.org/jira/browse/SPARK-17122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>
> Found below case was broken in Spark 2.0, but it can run successfully in 
> Spark 1.6. BTW, the case can also run successfully in Hive. Please see below 
> reproduces for details.
> Spark-SQL CLI:
> /usr/lib/spark/bin/spark-sql
> spark-sql> use test_db;
> spark-sql> DROP DATABASE IF EXISTS test_db CASCADE;
> 16/08/10 15:13:35 INFO execution.SparkSqlParser: Parsing command: DROP 
> DATABASE IF EXISTS test_db CASCADE
> Error in query: Can not drop current database `test_db`;
> Hive CLI:
> /usr/bin/hive
> hive> use test_db;
> OK
> hive> DROP DATABASE IF EXISTS test_db CASCADE;
> OK
> Time taken: 0.116 seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17122) Failed to drop database when use the database in Spark 2.0

2016-08-17 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-17122:
---

 Summary: Failed to drop database when use the database in Spark 2.0
 Key: SPARK-17122
 URL: https://issues.apache.org/jira/browse/SPARK-17122
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou


Found below case was broken in Spark 2.0, but it can run successfully in Spark 
1.6. BTW, the case can also run successfully in Hive. Please see below 
reproduces for details.

Spark-SQL CLI:
/usr/lib/spark/bin/spark-sql
spark-sql> use test_db;
spark-sql> DROP DATABASE IF EXISTS test_db CASCADE;
16/08/10 15:13:35 INFO execution.SparkSqlParser: Parsing command: DROP DATABASE 
IF EXISTS test_db CASCADE
Error in query: Can not drop current database `test_db`;

Hive CLI:
/usr/bin/hive
hive> use test_db;
OK
hive> DROP DATABASE IF EXISTS test_db CASCADE;
OK
Time taken: 0.116 seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374234#comment-15374234
 ] 

Yi Zhou commented on SPARK-16515:
-

Thank you for your quick attention for this issue.

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-12 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-16515:
---

 Summary: [SPARK][SQL] transformation script got failure for python 
script
 Key: SPARK-16515
 URL: https://issues.apache.org/jira/browse/SPARK-16515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou
Priority: Critical


Run below SQL and get transformation script error for python script like below 
error message.
Query SQL:
{code}
CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
SELECT DISTINCT
  sessionid,
  wcs_item_sk
FROM
(
  FROM
  (
SELECT
  wcs_user_sk,
  wcs_item_sk,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams
WHERE wcs_item_sk IS NOT NULL
AND   wcs_user_sk IS NOT NULL
DISTRIBUTE BY wcs_user_sk
SORT BY
  wcs_user_sk,
  tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
and sort by tstamp
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wcs_item_sk
  USING 'python q2-sessionize.py 3600'
  AS (
wcs_item_sk BIGINT,
sessionid STRING)
) q02_tmp_sessionize
CLUSTER BY sessionid
{code}

Error Message:
{code}
16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
(TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
status 1. Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
at 
org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
... 14 more

16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
(TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
exited with status 1. Error: Traceback (most recent call last):
  File "q2-sessionize.py", line 49, in 
user_sk, tstamp_str, item_sk  = line.strip().split("\t")
ValueError: too many values to unpack
) [duplicate 1]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16415) [Spark][SQL] - Failed to create table due to catalog string error

2016-07-07 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-16415:
---

 Summary: [Spark][SQL] - Failed to create table due to catalog 
string error
 Key: SPARK-16415
 URL: https://issues.apache.org/jira/browse/SPARK-16415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou
Priority: Critical


When create  below table like below schema, Spark SQL error out for struct type

SQL:
{code}
CREATE EXTERNAL TABLE date_dim_temporary
  ( d_date_sk bigint  --not null
  , d_date_id string  --not null
  , d_datestring
  , d_month_seq   int
  , d_week_seqint
  , d_quarter_seq int
  , d_yearint
  , d_dow int
  , d_moy int
  , d_dom int
  , d_qoy int
  , d_fy_year int
  , d_fy_quarter_seq  int
  , d_fy_week_seq int
  , d_day_namestring
  , d_quarter_namestring
  , d_holiday string
  , d_weekend string
  , d_following_holiday   string
  , d_first_dom   int
  , d_last_domint
  , d_same_day_ly int
  , d_same_day_lq int
  , d_current_day string
  , d_current_weekstring
  , d_current_month   string
  , d_current_quarter string
  , d_current_yearstring
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  STORED AS TEXTFILE LOCATION '/user/root/benchmarks/test/data/date_dim'

CREATE TABLE date_dim
STORED AS ORC
AS
SELECT * FROM date_dim_temporary
{code}

Error Message:
{code}
16/07/05 23:38:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 198.0 
(TID 677, hw-node5): java.lang.IllegalArgumentException: Error: : expected at 
the position 400 of 
'struct' but ' ' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770)
at 
org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:184)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:220)
at 
org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:93)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:130)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:246)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-07-05 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363691#comment-15363691
 ] 

Yi Zhou commented on SPARK-15730:
-

Thanks a lot [~chenghao] and [~yhuai] !

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
>Priority: Critical
> Fix For: 2.0.0
>
>
> {noformat}
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 

[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-14 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329408#comment-15329408
 ] 

Yi Zhou commented on SPARK-15730:
-

Hi [~chenghao]
This PR is tested and it's OK for my case.

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT', 'TOUCH', 

[jira] [Created] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-02 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-15730:
---

 Summary: [Spark SQL] the value of 'hiveconf' parameter in 
Spark-sql CLI don't take effect in spark-sql session
 Key: SPARK-15730
 URL: https://issues.apache.org/jira/browse/SPARK-15730
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou
Priority: Critical


/usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
--executor-cores 5 --num-executors 31 --master yarn-client --conf 
spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01

spark-sql> use test;
16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
CliDriver.java:376
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
CliDriver.java:376) with 1 output partitions
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
(processCmd at CliDriver.java:376)
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
(MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no missing 
parents
16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
in memory (estimated size 3.2 KB, free 2.4 GB)
16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
bytes in memory (estimated size 1964.0 B, free 2.4 GB)
16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
at DAGScheduler.scala:1012
16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 
(TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 on 
executor id: 10 hostname: 192.168.3.13.
16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 
(TID 2) in 1934 ms on 192.168.3.13 (1/1)
16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose tasks 
have all completed, from pool
16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
CliDriver.java:376) finished in 1.937 s
16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
CliDriver.java:376, took 1.962631 s
Time taken: 2.027 seconds
16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE IF 
EXISTS ${hiveconf:RESULT_TABLE}
Error in query:
mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 'GROUPING', 
'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 
'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 'TABLESAMPLE', 'ALTER', 'RENAME', 
'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 'PERCENT', 'BUCKET', 'OUT', 'OF', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 
'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 
'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 
'SEPARATED', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 
'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 
'UNLOCK', 'MSCK', 'REPAIR', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
'OPTION', 'LOCAL', 'INPATH', IDENTIFIER, 

[jira] [Commented] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-05-26 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301988#comment-15301988
 ] 

Yi Zhou commented on SPARK-15034:
-

Hi [~yhuai]
I can integrate Spark SQL with hive metastore well in Spark 1.6. But now I am 
very confused about Spark SQL integration with Hive metastore in Spark 2.0. I 
want to get your help what's correct steps or required configurations in Spark 
2.0 ? My case is that using Spark 2.0 connect to a existing hive metastore 
database but it always can't show this existing database via spark-sql CLI but 
i can see database via Hive CLI...Please kindly see below my experiment. Thanks 
in advance !

Build package command:
{code}
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
{code}

key items in spark-defaults.conf
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/user/hive/warehouse
{code}

{code}
/usr/lib/spark/bin/spark-sql
spark-sql> show databases;
16/05/26 20:06:04 INFO execution.SparkSqlParser: Parsing command: show databases
16/05/26 20:06:04 INFO log.PerfLogger: 
16/05/26 20:06:04 INFO metastore.HiveMetaStore: 0: create_database: 
Database(name:default, description:default database, 
locationUri:hdfs://hw-node2:8020/user/hive/warehouse, parameters:{})
16/05/26 20:06:04 INFO HiveMetaStore.audit: ugi=rootip=unknown-ip-addr  
cmd=create_database: Database(name:default, description:default database, 
locationUri:hdfs://hw-node2:8020/user/hive/warehouse, parameters:{})
16/05/26 20:06:04 ERROR metastore.RetryingHMSHandler: 
AlreadyExistsException(message:Database default already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:944)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy34.create_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:646)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:105)
at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:345)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:84)
at 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301468#comment-15301468
 ] 

Yi Zhou commented on SPARK-15345:
-

I'm confused that it worked well with Apache Spark 1.6 and can't work after 
only replacing it with 2.0. 

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301456#comment-15301456
 ] 

Yi Zhou commented on SPARK-15345:
-

I issued 'show databases;' , ' use XXX' and 'show tables;'  and found the 
result is empty and there is no any tables to show. BTW, i can see tables by 
'show tables' in Hive CLI.

{code}
spark-sql> show databases;
16/05/26 11:11:47 INFO execution.SparkSqlParser: Parsing command: show databases
16/05/26 11:11:47 INFO log.PerfLogger: 
16/05/26 11:11:47 INFO metastore.HiveMetaStore: 0: create_database: 
Database(name:default, description:default database, 
locationUri:hdfs://hw-node2:8020/user/hive/warehouse, parameters:{})
16/05/26 11:11:47 INFO HiveMetaStore.audit: ugi=rootip=unknown-ip-addr  
cmd=create_database: Database(name:default, description:default database, 
locationUri:hdfs://hw-node2:8020/user/hive/warehouse, parameters:{})
16/05/26 11:11:47 ERROR metastore.RetryingHMSHandler: 
AlreadyExistsException(message:Database default already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:944)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy34.create_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:646)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:105)
at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:345)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:84)
at 
org.apache.spark.sql.hive.HiveSessionCatalog.(HiveSessionCatalog.scala:50)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)
at 
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at 
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:532)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:652)
 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301427#comment-15301427
 ] 

Yi Zhou commented on SPARK-15345:
-

Hi [~jeffzhang]
In my test environment, i installed CDH 5.7.0 Hive( hive metastore service) and 
Apache Spark 2.0 snapshot..i can work well with Apache Spark 1.6 but it can't 
work after deploying the Apache Spark 2.0 and it can't connect my existing hive 
metastore database. What's the key configuration in spark-defaults.conf to work 
except for below configurations? Please kindly correct me.. Thanks !
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/user/hive/warehouse
{code}

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> 

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15301342#comment-15301342
 ] 

Yi Zhou commented on SPARK-13955:
-

Thanks a lot [~jerryshao]!

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300088#comment-15300088
 ] 

Yi Zhou commented on SPARK-15345:
-

Hi [~jeffzhang]
I rebuild Spark 2.0 with https://github.com/apache/spark/pull/13160, but Spark 
SQL still can't find existing hive metastore database. i still saw a 
metastore_db created..

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300014#comment-15300014
 ] 

Yi Zhou commented on SPARK-13955:
-

Hi [~jerryshao]
I build a Spark 2.0 snapshot package which included a 'jars' folder in package. 
so we only configure 'spark.yarn.jars' like this 
spark.yarn.jars=local:/usr/lib/spark/jars/* in spark-defaults.conf , right ?

Thanks
Yi

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299587#comment-15299587
 ] 

Yi Zhou commented on SPARK-15345:
-

OK Thanks !

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  is now cleaned +++
> 16/05/16 12:17:47 

[jira] [Comment Edited] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299568#comment-15299568
 ] 

Yi Zhou edited comment on SPARK-15345 at 5/25/16 6:56 AM:
--

1) Spark SQL can't find existing hive metastore database in spark-sql shell by 
issuing 'show databases;'
2) Always told me that there is already existing database..(i saw a local derby 
metastore_db folder in current directory). it seemed that spark sql can't read 
the hive conf(eg, hive-site.xml )..
3) Key configurations in spark-defaults.conf:
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
{code}

16/05/23 09:48:24 ERROR metastore.RetryingHMSHandler: 
AlreadyExistsException(message:Database test_sparksql already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:898)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:133)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy34.create_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:645)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:91)
at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:341)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
at 
org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299568#comment-15299568
 ] 

Yi Zhou commented on SPARK-15345:
-

1) Spark SQL can't find existing hive metastore database in spark-sql shell by 
issuing 'show databases;'
2) Always told me that there is already existing database..(i saw a local derby 
metastore_db folder in current directory). it seemed that spark sql can't read 
the hive conf..
3) Key configurations in spark-defaults.conf:
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
{code}

16/05/23 09:48:24 ERROR metastore.RetryingHMSHandler: 
AlreadyExistsException(message:Database test_sparksql already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:898)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:133)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy34.create_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:645)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:91)
at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:341)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
at 
org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
at 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296105#comment-15296105
 ] 

Yi Zhou commented on SPARK-15396:
-

Thanks a lot for your respone !

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296088#comment-15296088
 ] 

Yi Zhou commented on SPARK-15396:
-

BTW, I only replaced 1.6 with 2.0 package.

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> 

[jira] [Comment Edited] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296086#comment-15296086
 ] 

Yi Zhou edited comment on SPARK-15396 at 5/23/16 8:22 AM:
--

hive-site.xml put in right 'conf' folder in Spark home. my concerning is that 
this worked well in Spark 1.6. not work in 2.0


was (Author: jameszhouyi):
hive-site.xml put in right 'conf' folder in Spark home. my concerning is that 
this worked well in Spark 1.6

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296086#comment-15296086
 ] 

Yi Zhou commented on SPARK-15396:
-

hive-site.xml put in right 'conf' folder in Spark home. my concerning is that 
this worked well in Spark 1.6

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 

[jira] [Comment Edited] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296010#comment-15296010
 ] 

Yi Zhou edited comment on SPARK-15396 at 5/23/16 7:02 AM:
--

sorry i can't access the 
http://www.hiregion.com/2010/01/hive-metastore-derby-db.html, 
In my hive-site.xml, it like below: 
{code}
 
javax.jdo.option.ConnectionURL
jdbc:postgresql://test-node1:7432/hive
  
{code}
Is there any problem in this parameter ?


was (Author: jameszhouyi):
sorry i can't access the 
http://www.hiregion.com/2010/01/hive-metastore-derby-db.html, 
In my hive-site.xml, it like below: 
{code}
 
javax.jdo.option.ConnectionURL
jdbc:postgresql://test-node1:7432/hive
  
{cdoe}
Is there any problem in this parameter ?

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-23 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296010#comment-15296010
 ] 

Yi Zhou commented on SPARK-15396:
-

sorry i can't access the 
http://www.hiregion.com/2010/01/hive-metastore-derby-db.html, 
In my hive-site.xml, it like below: 
{code}
 
javax.jdo.option.ConnectionURL
jdbc:postgresql://test-node1:7432/hive
  
{cdoe}
Is there any problem in this parameter ?

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295920#comment-15295920
 ] 

Yi Zhou commented on SPARK-15396:
-

these parameters worked well in Spark 1.6.1 but no sure why doesn't work with 
hive metastore in 2.0

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295874#comment-15295874
 ] 

Yi Zhou commented on SPARK-15396:
-

Add,
1) Spark SQL can't find existing hive databases in spark-sql shell by issuing 
'show databases;'
2) Always told me that there is already existing 'test_sparksql' database...
{code}
16/05/23 09:48:24 ERROR metastore.RetryingHMSHandler: 
AlreadyExistsException(message:Database test_sparksql already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:898)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:133)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy34.create_database(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:645)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:91)
at com.sun.proxy.$Proxy35.createDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:341)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:289)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:94)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:93)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
at 
org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
at org.apache.spark.sql.Dataset.(Dataset.scala:187)
at org.apache.spark.sql.Dataset.(Dataset.scala:168)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:649)
at 

[jira] [Comment Edited] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295836#comment-15295836
 ] 

Yi Zhou edited comment on SPARK-15396 at 5/23/16 2:06 AM:
--

Hi , 
even if setting 'spark.sql.warehouse.dir' in spark-defaults.conf , it still 
can't connect the my existing hive metastore database. And  i also saw it 
create a 'metastore_db' folder in my current working directory. Could you 
please point me out what's the potential cause ? Thanks !
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/usr/hive/warehouse
{code}


was (Author: jameszhouyi):
Hi , 
even if setting 'spark.sql.warehouse.dir' in spark-defaults.conf , it still 
can't connect the my existing hive metastore database. And  i also saw it 
create a 'metastore_db' folder in my current working directory. Could you 
please point me out what's the potential cause ? Thanks !
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/usr/hive/warehouse

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295836#comment-15295836
 ] 

Yi Zhou commented on SPARK-15396:
-

Hi , 
even if setting 'spark.sql.warehouse.dir' in spark-defaults.conf , it still 
can't connect the my existing hive metastore database. And  i also saw it 
create a 'metastore_db' folder in my current working directory. Could you 
please point me out what's the potential cause ? Thanks !
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.sql.warehouse.dir=/usr/hive/warehouse

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292672#comment-15292672
 ] 

Yi Zhou commented on SPARK-15396:
-

It seem it is not only a doc issue , it may be functional issue.

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292655#comment-15292655
 ] 

Yi Zhou commented on SPARK-15396:
-

Hi [~rxin]
I saw a bug fix https://issues.apache.org/jira/browse/SPARK-15345. Is it also 
fixed this issue in this jira ?

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> 

[jira] [Created] (SPARK-15397) [Spark][SQL] 'locate' UDF got different result with boundary value case compared to Hive engine

2016-05-18 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-15397:
---

 Summary: [Spark][SQL] 'locate' UDF got different result with 
boundary value case compared to Hive engine
 Key: SPARK-15397
 URL: https://issues.apache.org/jira/browse/SPARK-15397
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.0, 2.0.0
Reporter: Yi Zhou


Spark SQL:
select locate("abc", "abc", 1);
0

Hive:
select locate("abc", "abc", 1);
1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15396) [Spark] [SQL] It can't connect hive metastore database

2016-05-18 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290306#comment-15290306
 ] 

Yi Zhou commented on SPARK-15396:
-

BTW, where can i found this parameter in doc ? i checked 
http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html
 but no found this.

> [Spark] [SQL] It can't connect hive metastore database
> --
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] It can't connect hive metastore database

2016-05-18 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290295#comment-15290295
 ] 

Yi Zhou commented on SPARK-15396:
-

Thanks ! I will try it.

> [Spark] [SQL] It can't connect hive metastore database
> --
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.

[jira] [Updated] (SPARK-15396) [Spark] [SQL] It can't connect hive metastore database

2016-05-18 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-15396:

Description: 
I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master code(commit 
ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue that it 
always connect local derby database and can't connect my existing hive 
metastore database. Could you help me to check what's the root cause ? What's 
specific configuration for integration with hive metastore in Spark 2.0 ? BTW, 
this case is OK in Spark 1.6. Thanks in advance !

Build package command:
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests

Key configurations in spark-defaults.conf:
{code}
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*
{code}

There is existing hive metastore database named by "test_sparksql". I always 
got error "metastore.ObjectStore: Failed to get database test_sparksql, 
returning NoSuchObjectException" after issuing 'use test_sparksql'. Please see 
below steps for details.
 
$ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, 
and you are trying to register an identical plugin located at URL 
"file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
16/05/12 22:23:33 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
underlying DB is DERBY
16/05/12 22:23:33 INFO metastore.ObjectStore: Initialized ObjectStore
16/05/12 22:23:33 WARN metastore.ObjectStore: Version 

[jira] [Created] (SPARK-15396) [Spark] [SQL] It can't connect hive metastore database

2016-05-18 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-15396:
---

 Summary: [Spark] [SQL] It can't connect hive metastore database
 Key: SPARK-15396
 URL: https://issues.apache.org/jira/browse/SPARK-15396
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yi Zhou
Priority: Critical


I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master code(commit 
ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue that it 
always connect local derby database and can't connect my existing hive 
metastore database. Could you help me to check what's the root cause ? What's 
specific configuration for integration with hive metastore in Spark 2.0 ? BTW, 
this case is OK in Spark 1.6. Thanks in advance !

Build package command:
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests

Key configurations in spark-defaults.conf:
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*

There is existing hive metastore database named by "test_sparksql". I always 
got error "metastore.ObjectStore: Failed to get database test_sparksql, 
returning NoSuchObjectException" after issuing 'use test_sparksql'. Please see 
below steps for details.
 
$ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, and 
you are trying to register an identical plugin located at URL 
"file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, 
and you are trying to register an identical plugin located at URL 
"file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
16/05/12 22:23:33 INFO 

[jira] [Commented] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-09 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276055#comment-15276055
 ] 

Yi Zhou commented on SPARK-15219:
-

Posted the core physical plan

> [Spark SQL] it don't support to detect runtime temporary table for enabling 
> broadcast hash join optimization
> 
>
> Key: SPARK-15219
> URL: https://issues.apache.org/jira/browse/SPARK-15219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yi Zhou
>
> We observed an interesting thing about broadcast Hash join( similar to Map 
> Join in Hive) when comparing the implementation by Hive on MR engine. The 
> blew query is a multi-way join operation based on 3 tables including 
> product_reviews, 2 run-time temporary result tables(fsr and fwr) from 
> ‘select’ query operation and also there is a two-way join(1 table and 1 
> run-time temporary table) in both 'fsr' and 'fwr',which cause slower 
> performance than Hive on MR. We investigated the difference between Spark SQL 
> and Hive on MR engine and found that there are total of 5 map join tasks with 
> tuned map join parameters in Hive on MR but there are only 2 broadcast hash 
> join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for 
> broadcast hash join. From our investigation, it seems that if there is 
> run-time temporary table in join operation in Spark SQL engine it will not 
> detect such table for enabling broadcast hash join optimization. 
> Core SQL snippet:
> {code}
> INSERT INTO TABLE q19_spark_sql_power_test_0_result
> SELECT *
> FROM
> ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
> select clause is not allowed
>   SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
>   (
> item_sk,
> review_sentence,
> sentiment,
> sentiment_word
>   )
>   FROM product_reviews pr,
>   (
> --store returns in week ending given date
> SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
> FROM store_returns sr,
> (
>   -- within the week ending a given date
>   SELECT d1.d_date_sk
>   FROM date_dim d1, date_dim d2
>   WHERE d1.d_week_seq = d2.d_week_seq
>   AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', 
> '2004-12-20' )
> ) sr_dateFilter
> WHERE sr.sr_returned_date_sk = d_date_sk
> GROUP BY sr_item_sk --across all store and web channels
> HAVING sr_item_qty > 0
>   ) fsr,
>   (
> --web returns in week ending given date
> SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
> FROM web_returns wr,
> (
>   -- within the week ending a given date
>   SELECT d1.d_date_sk
>   FROM date_dim d1, date_dim d2
>   WHERE d1.d_week_seq = d2.d_week_seq
>   AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', 
> '2004-12-20' )
> ) wr_dateFilter
> WHERE wr.wr_returned_date_sk = d_date_sk
> GROUP BY wr_item_sk  --across all store and web channels
> HAVING wr_item_qty > 0
>   ) fwr
>   WHERE fsr.sr_item_sk = fwr.wr_item_sk
>   AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
>   -- equivalent across all store and web channels (within a tolerance of +/- 
> 10%)
>   AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
> )extractedSentiments
> WHERE sentiment= 'NEG' --if there are any major negative reviews.
> ORDER BY item_sk,review_sentence,sentiment,sentiment_word
> ;
> {code}
> Physical Plan:
> {code}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation bigbench_3t_sparksql, 
> q19_spark_sql_run_query_0_result, None, Map(), false, false
> +- ConvertToSafe
>+- Sort [item_sk#537L ASC,review_sentence#538 ASC,sentiment#539 
> ASC,sentiment_word#540 ASC], true, 0
>   +- ConvertToUnsafe
>  +- Exchange rangepartitioning(item_sk#537L ASC,review_sentence#538 
> ASC,sentiment#539 ASC,sentiment_word#540 ASC,200), None
> +- ConvertToSafe
>+- Project 
> [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540]
>   +- Filter (sentiment#539 = NEG)
>  +- !Generate 
> HiveGenericUDTF#io.bigdatabenchmark.v1.queries.q10.SentimentUDF(pr_item_sk#363L,pr_review_content#366),
>  false, false, 
> [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540]
> +- ConvertToSafe
>+- Project [pr_item_sk#363L,pr_review_content#366]
>   +- Filter (abs((cast((sr_item_qty#356L - 
> wr_item_qty#357L) as double) / (cast((sr_item_qty#356L + wr_item_qty#357L) as 
> double) / 2.0))) <= 0.1)
>  +- SortMergeJoin [sr_item_sk#369L], 
> [wr_item_sk#445L]
> :- Sort 

[jira] [Updated] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-09 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-15219:

Description: 
We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr',which cause slower performance than Hive on MR. We investigated 
the difference between Spark SQL and Hive on MR engine and found that there are 
total of 5 map join tasks with tuned map join parameters in Hive on MR but 
there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger 
threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems 
that if there is run-time temporary table in join operation in Spark SQL engine 
it will not detect such table for enabling broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) wr_dateFilter
WHERE wr.wr_returned_date_sk = d_date_sk
GROUP BY wr_item_sk  --across all store and web channels
HAVING wr_item_qty > 0
  ) fwr
  WHERE fsr.sr_item_sk = fwr.wr_item_sk
  AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
  -- equivalent across all store and web channels (within a tolerance of +/- 
10%)
  AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
)extractedSentiments
WHERE sentiment= 'NEG' --if there are any major negative reviews.
ORDER BY item_sk,review_sentence,sentiment,sentiment_word
;
{code}

Physical Plan:
{code}
== Physical Plan ==
InsertIntoHiveTable MetastoreRelation bigbench_3t_sparksql, 
q19_spark_sql_run_query_0_result, None, Map(), false, false
+- ConvertToSafe
   +- Sort [item_sk#537L ASC,review_sentence#538 ASC,sentiment#539 
ASC,sentiment_word#540 ASC], true, 0
  +- ConvertToUnsafe
 +- Exchange rangepartitioning(item_sk#537L ASC,review_sentence#538 
ASC,sentiment#539 ASC,sentiment_word#540 ASC,200), None
+- ConvertToSafe
   +- Project 
[item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540]
  +- Filter (sentiment#539 = NEG)
 +- !Generate 
HiveGenericUDTF#io.bigdatabenchmark.v1.queries.q10.SentimentUDF(pr_item_sk#363L,pr_review_content#366),
 false, false, 
[item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540]
+- ConvertToSafe
   +- Project [pr_item_sk#363L,pr_review_content#366]
  +- Filter (abs((cast((sr_item_qty#356L - 
wr_item_qty#357L) as double) / (cast((sr_item_qty#356L + wr_item_qty#357L) as 
double) / 2.0))) <= 0.1)
 +- SortMergeJoin [sr_item_sk#369L], 
[wr_item_sk#445L]
:- Sort [sr_item_sk#369L ASC], false, 0
:  +- Project 
[pr_item_sk#363L,sr_item_qty#356L,pr_review_content#366,sr_item_sk#369L]
: +- SortMergeJoin [pr_item_sk#363L], 
[sr_item_sk#369L]
::- Sort [pr_item_sk#363L ASC], 
false, 0
::  +- TungstenExchange 
hashpartitioning(pr_item_sk#363L,200), None
:: +- ConvertToUnsafe
::+- HiveTableScan 
[pr_item_sk#363L,pr_review_content#366], MetastoreRelation 
bigbench_3t_sparksql, product_reviews, Some(pr)

[jira] [Updated] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-09 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-15219:

Description: 
We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr',which cause slower performance than Hive on MR. We investigated 
the difference between Spark SQL and Hive on MR engine and found that there are 
total of 5 map join tasks with tuned map join parameters in Hive on MR but 
there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger 
threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems 
that if there is run-time temporary table in join operation in Spark SQL engine 
it will not detect such table for enabling broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) wr_dateFilter
WHERE wr.wr_returned_date_sk = d_date_sk
GROUP BY wr_item_sk  --across all store and web channels
HAVING wr_item_qty > 0
  ) fwr
  WHERE fsr.sr_item_sk = fwr.wr_item_sk
  AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
  -- equivalent across all store and web channels (within a tolerance of +/- 
10%)
  AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
)extractedSentiments
WHERE sentiment= 'NEG' --if there are any major negative reviews.
ORDER BY item_sk,review_sentence,sentiment,sentiment_word
;
{code}

  was:
We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on 
MR engine and found that there are total of 5 map join tasks with tuned map 
join parameters in Hive on MR but there are only 2 broadcast hash join tasks in 
Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. 
From our investigation, it seems that if there is run-time temporary table in 
join operation in Spark SQL engine it will not detect such table for enabling 
broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the 

[jira] [Created] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-09 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-15219:
---

 Summary: [Spark SQL] it don't support to detect runtime temporary 
table for enabling broadcast hash join optimization
 Key: SPARK-15219
 URL: https://issues.apache.org/jira/browse/SPARK-15219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yi Zhou


We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on 
MR engine and found that there are total of 5 map join tasks with tuned map 
join parameters in Hive on MR but there are only 2 broadcast hash join tasks in 
Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. 
From our investigation, it seems that if there is run-time temporary table in 
join operation in Spark SQL engine it will not detect such table for enabling 
broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) wr_dateFilter
WHERE wr.wr_returned_date_sk = d_date_sk
GROUP BY wr_item_sk  --across all store and web channels
HAVING wr_item_qty > 0
  ) fwr
  WHERE fsr.sr_item_sk = fwr.wr_item_sk
  AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
  -- equivalent across all store and web channels (within a tolerance of +/- 
10%)
  AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
)extractedSentiments
WHERE sentiment= 'NEG' --if there are any major negative reviews.
ORDER BY item_sk,review_sentence,sentiment,sentiment_word
;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8428) TimSort Comparison method violates its general contract with CLUSTER BY

2016-05-05 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273534#comment-15273534
 ] 

Yi Zhou commented on SPARK-8428:


We found the similar issue with Spark 1.6.1 in our larger data size test..I 
posted the details like below. Then we try to increase the 
spark.sql.shuffle.partitions to resolve it. 

{code}
CREATE TABLE q26_spark_sql_run_query_0_temp (
  cid  BIGINT,
  id1  double,
  id2  double,
  id3  double,
  id4  double,
  id5  double,
  id6  double,
  id7  double,
  id8  double,
  id9  double,
  id10 double,
  id11 double,
  id12 double,
  id13 double,
  id14 double,
  id15 double
)

INSERT INTO TABLE q26_spark_sql_run_query_0_temp
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=12 THEN 1 ELSE NULL END) AS id12,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15
FROM store_sales ss
INNER JOIN item i
  ON (ss.ss_item_sk = i.i_item_sk
  AND i.i_category IN ('Books')
  AND ss.ss_customer_sk IS NOT NULL
)
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5
ORDER BY cid
{code}

{code}
16/05/05 14:50:03 WARN scheduler.TaskSetManager: Lost task 12.0 in stage 162.0 
(TID 15153, node6): java.lang.IllegalArgumentException: Comparison method 
violates its
general contract!
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 

[jira] [Comment Edited] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206635#comment-15206635
 ] 

Yi Zhou edited comment on SPARK-11293 at 3/22/16 3:56 PM:
--

seem to hit the issue with Spark 1.6.1 not sure if this is relative to this..if 
yes, it can be fixed in Spark 1.6.2 ?
{code}
16/03/22 23:10:26 INFO memory.TaskMemoryManager: Allocate page number 16 
(67108864 bytes)
16/03/22 23:10:26 INFO sort.UnsafeExternalSorter: Thread 221 spilling sort data 
of 1472.0 MB to disk (0  time so far)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Allocate page number 1 
(1060044737 bytes)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Memory used in task 9302
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@8bac554: 32.0 KB
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@7a117b4f: 
512.0 MB
16/03/22 23:11:26 INFO memory.TaskMemoryManager: 0 bytes of memory were used by 
task 9302 but are not associated with specific consumers
16/03/22 23:11:26 INFO memory.TaskMemoryManager: 14909439433 bytes of memory 
are used for execution and 1376877 bytes of memory are used for storage
16/03/22 23:11:26 WARN memory.TaskMemoryManager: leak 32.0 KB memory from 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@8bac554
16/03/22 23:11:26 ERROR executor.Executor: Managed memory leak detected; size = 
32768 bytes, TID = 9302
16/03/22 23:11:26 ERROR executor.Executor: Exception in task 192.0 in stage 
153.0 (TID 9302)
java.lang.OutOfMemoryError: Unable to acquire 1073741824 bytes of memory, got 
1060044737
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jameszhouyi):
seem to hit the issue with Spark 1.6.1 not sure if this is relative to this..if 
yes, it can be fixed in Spark 1.6.2 ?

16/03/22 23:10:26 INFO memory.TaskMemoryManager: Allocate page number 16 
(67108864 bytes)
16/03/22 23:10:26 INFO sort.UnsafeExternalSorter: Thread 221 spilling sort data 
of 1472.0 MB to disk (0  time so far)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Allocate page number 1 
(1060044737 bytes)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Memory used in task 9302
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Acquired by 

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206635#comment-15206635
 ] 

Yi Zhou commented on SPARK-11293:
-

seem to hit the issue with Spark 1.6.1 not sure if this is relative to this..if 
yes, it can be fixed in Spark 1.6.2 ?

16/03/22 23:10:26 INFO memory.TaskMemoryManager: Allocate page number 16 
(67108864 bytes)
16/03/22 23:10:26 INFO sort.UnsafeExternalSorter: Thread 221 spilling sort data 
of 1472.0 MB to disk (0  time so far)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Allocate page number 1 
(1060044737 bytes)
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Memory used in task 9302
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@8bac554: 32.0 KB
16/03/22 23:11:26 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@7a117b4f: 
512.0 MB
16/03/22 23:11:26 INFO memory.TaskMemoryManager: 0 bytes of memory were used by 
task 9302 but are not associated with specific consumers
16/03/22 23:11:26 INFO memory.TaskMemoryManager: 14909439433 bytes of memory 
are used for execution and 1376877 bytes of memory are used for storage
16/03/22 23:11:26 WARN memory.TaskMemoryManager: leak 32.0 KB memory from 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@8bac554
16/03/22 23:11:26 ERROR executor.Executor: Managed memory leak detected; size = 
32768 bytes, TID = 9302
16/03/22 23:11:26 ERROR executor.Executor: Exception in task 192.0 in stage 
153.0 (TID 9302)
java.lang.OutOfMemoryError: Unable to acquire 1073741824 bytes of memory, got 
1060044737
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> 

[jira] [Created] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session

2015-11-24 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-11972:
---

 Summary: [Spark SQL] the value of 'hiveconf' parameter in CLI 
can't be got after enter spark-sql session
 Key: SPARK-11972
 URL: https://issues.apache.org/jira/browse/SPARK-11972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Yi Zhou


Reproduce Steps:
/usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
--executor-cores 5 --num-executors 31 --master yarn-client --conf 
spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01

>use test;
> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF EXISTS 
${hiveconf:RESULT_TABLE}
NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= identifier 
-> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:64)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
  

[jira] [Updated] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session

2015-11-24 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-11972:

Description: 
Reproduce Steps:
/usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
--executor-cores 5 --num-executors 31 --master yarn-client --conf 
spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
{code}
>use test;
> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF EXISTS 
${hiveconf:RESULT_TABLE}
NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= identifier 
-> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:64)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 

[jira] [Updated] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session

2015-11-24 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-11972:

Description: 
Reproduce Steps:
/usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
--executor-cores 5 --num-executors 31 --master yarn-client --conf 
spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
{code}
>use test;
>DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF EXISTS 
${hiveconf:RESULT_TABLE}
NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= identifier 
-> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:64)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 

[jira] [Updated] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session

2015-11-24 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-11972:

Priority: Critical  (was: Major)

> [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter 
> spark-sql session
> ---
>
> Key: SPARK-11972
> URL: https://issues.apache.org/jira/browse/SPARK-11972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Reproduce Steps:
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> {code}
> >use test;
> >DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF 
> EXISTS ${hiveconf:RESULT_TABLE}
> NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= 
> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME 
> $tab) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
> at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
> at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
> at 

[jira] [Commented] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session

2015-11-24 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026283#comment-15026283
 ] 

Yi Zhou commented on SPARK-11972:
-

Thanks [~adrian-wang] !
the case passed  now after applying the patch.

> [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter 
> spark-sql session
> ---
>
> Key: SPARK-11972
> URL: https://issues.apache.org/jira/browse/SPARK-11972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Reproduce Steps:
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> {code}
> >use test;
> >DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF 
> EXISTS ${hiveconf:RESULT_TABLE}
> NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= 
> identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME 
> $tab) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
> at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
> at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
> at 
> 

[jira] [Created] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-09-28 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10865:
---

 Summary: [Spark SQL] [UDF] the ceil/ceiling function got wrong 
return value type
 Key: SPARK-10865
 URL: https://issues.apache.org/jira/browse/SPARK-10865
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou


As per ceil/ceiling definition,it should get BIGINT return value
-ceil(DOUBLE a), ceiling(DOUBLE a)
-Returns the minimum BIGINT value that is equal to or greater than a.

But in current implementation, it got wrong value type.
e.g., 
select ceil(2642.12) from udf_test_web_sales limit 1;
2643.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-09-28 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10865:

Description: 
As per ceil/ceiling definition,it should get BIGINT return value
-ceil(DOUBLE a), ceiling(DOUBLE a)
-Returns the minimum BIGINT value that is equal to or greater than a.

But in current Spark implementation, it got wrong value type.
e.g., 
select ceil(2642.12) from udf_test_web_sales limit 1;
2643.0

In hive implementation, it got return value type like below:
hive> select ceil(2642.12) from udf_test_web_sales limit 1;
OK
2643

  was:
As per ceil/ceiling definition,it should get BIGINT return value
-ceil(DOUBLE a), ceiling(DOUBLE a)
-Returns the minimum BIGINT value that is equal to or greater than a.

But in current implementation, it got wrong value type.
e.g., 
select ceil(2642.12) from udf_test_web_sales limit 1;
2643.0


> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10866) [Spark SQL] [UDF] the floor function got wrong return value type

2015-09-28 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10866:
---

 Summary: [Spark SQL] [UDF] the floor function got wrong return 
value type
 Key: SPARK-10866
 URL: https://issues.apache.org/jira/browse/SPARK-10866
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou




As per floor definition,it should get BIGINT return value
-floor(DOUBLE a)
-Returns the maximum BIGINT value that is equal to or less than a.

But in current Spark implementation, it got wrong value type.
e.g.,
select floor(2642.12) from udf_test_web_sales limit 1;
2642.0

In hive implementation, it got return value type like below:
hive> select ceil(2642.12) from udf_test_web_sales limit 1;
OK
2642




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-24 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906291#comment-14906291
 ] 

Yi Zhou commented on SPARK-10474:
-

Hi [~andrewor14] [~yhuai]. It's OK for me and get no errors. Thanks !

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903740#comment-14903740
 ] 

Yi Zhou commented on SPARK-10733:
-

 yes. i still got error after applying the commit.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902230#comment-14902230
 ] 

Yi Zhou commented on SPARK-10474:
-

see the my comments in https://issues.apache.org/jira/browse/SPARK-10733

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902563#comment-14902563
 ] 

Yi Zhou commented on SPARK-10733:
-

Apply 
https://github.com/andrewor14/spark/commit/2a9cf5a1b3be4fca858fe09f17750ddb450055d8,
 throw out error message:

15/09/22 17:02:57 WARN scheduler.TaskSetManager: Lost task 116.0 in stage 152.0 
(TID 1737, bb-node1): java.io.IOException: Unable to acquire 16777216 bytes of 
memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:134)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at 

[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959
 ] 

Yi Zhou commented on SPARK-10733:
-

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901959#comment-14901959
 ] 

Yi Zhou edited comment on SPARK-10733 at 9/22/15 5:18 AM:
--

1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/\*:/usr/lib/hadoop/client/\*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer




was (Author: jameszhouyi):
1) I have never set `spark.task.cpus` and only use  by default.
2) scale factor=1000 (1TB data set)
3) Spark conf like below
spark.shuffle.manager=sort
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.serializer=org.apache.spark.serializer.KryoSerializer



> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900562#comment-14900562
 ] 

Yi Zhou commented on SPARK-10474:
-

Hi [~andrewor14]
I found a new error like below after apply this PR. Not sure if some one have 
reported it or I miss other PR relative to below issue.

15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 152.0 
(TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 bytes of 
memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> 

[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901951#comment-14901951
 ] 

Yi Zhou commented on SPARK-10733:
-

Key SQL Query:
INSERT INTO TABLE test_table
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
FROM store_sales ss
INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_category IN ('Books')
AND ss.ss_customer_sk IS NOT NULL
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5

Note:
the store_sales is a big fact table and item is a small dimension table.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746929#comment-14746929
 ] 

Yi Zhou edited comment on SPARK-10474 at 9/16/15 5:55 AM:
--

BTW, the "spark.shuffle.safetyFraction" is not public parameter for user..


was (Author: jameszhouyi):
BTW, the "spark.shuffle.safetyFraction" is not public ..

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746929#comment-14746929
 ] 

Yi Zhou commented on SPARK-10474:
-

BTW, the "spark.shuffle.safetyFraction" is not public ..

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746881#comment-14746881
 ] 

Yi Zhou commented on SPARK-10474:
-

Thanks [~chenghao]. It's better not to throw such exception.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10484:
---

 Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM 
error when cross join happen
 Key: SPARK-10484
 URL: https://issues.apache.org/jira/browse/SPARK-10484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Critical


Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange bebavior 

  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange bebavior 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Issue Type: Improvement  (was: Bug)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Issue Type: Bug  (was: Improvement)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two table do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM error when 
two table do cross join  (was: [Spark SQL]  Come across lost task(timeout) or 
GC OOM error when cross join happen)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two table do 
> cross join
> 
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM error when 
two tables do cross join  (was: [Spark SQL]  Come across lost task(timeout) or 
GC OOM error when two table do cross join)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange behavior that exchanging the two table in 'From' clause 
can pass.
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange bebavior 


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE 

[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange behavior that exchanging the two table in 'From' clause 
can pass.
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}



> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 

[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}



> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> 

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-09-08 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734721#comment-14734721
 ] 

Yi Zhou commented on SPARK-5791:


[~yhuai], Yes. Thank you !

> [Spark SQL] show poor performance when multiple table do join operation
> ---
>
> Key: SPARK-5791
> URL: https://issues.apache.org/jira/browse/SPARK-5791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
> Attachments: Physcial_Plan_Hive.txt, 
> Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-07 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10474:

Description: 
In aggregation case, a  Lost task happened with below error. I am not sure if 
the root cause is same as https://issues.apache.org/jira/browse/SPARK-10341

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


  was:
In aggregation case, below error happened. I am not sure if the root cause is 
same as https://issues.apache.org/jira/browse/SPARK-10341

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 

[jira] [Created] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-07 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10474:
---

 Summary: Aggregation failed with unable to acquire memory
 Key: SPARK-10474
 URL: https://issues.apache.org/jira/browse/SPARK-10474
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Critical


In aggregation case, below error happened. I am not sure if the root cause is 
same as https://issues.apache.org/jira/browse/SPARK-10341

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-07 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10474:

Description: 
In aggregation case, a  Lost task happened with below error.

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


  was:
In aggregation case, a  Lost task happened with below error. I am not sure if 
the root cause is same as https://issues.apache.org/jira/browse/SPARK-10341

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   

[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-07 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10474:

Description: 
In aggregation case, a  Lost task happened with below error.

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Key SQL Query
{code:sql}
INSERT INTO TABLE test_table
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
FROM store_sales ss
INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_category IN ('Books')
AND ss.ss_customer_sk IS NOT NULL
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5
{code}

Note:
the store_sales is a big fact table and item is a small dimension table.


  was:
In aggregation case, a  Lost task happened with below error.

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 

[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-06 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10310:

Summary: [Spark SQL] All result records will be popluated into ONE line 
during the script transform due to missing the correct line/filed delimiter  
(was: [Spark SQL] All result records will be popluated into ONE line during the 
script transform due to missing the correct line/filed delimeter)

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

2015-09-06 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733195#comment-14733195
 ] 

Yi Zhou commented on SPARK-10310:
-

Hi [~marmbrus]
Could you please help to review and evaluate this critical Spark SQL issue to 
see if it can be fixed in Spark 1.5.0 (I saw the code is ready) ?  The issue 
caused to fail to extract correct record due to missing line/filed delimiter 
and it blocked the conformance validation. Thanks in advance !

> [Spark SQL] All result records will be popluated into ONE line during the 
> script transform due to missing the correct line/filed delimiter
> --
>
> Key: SPARK-10310
> URL: https://issues.apache.org/jira/browse/SPARK-10310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yi Zhou
>Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found 
> that all result records were wroten in ONE line as input from "select" 
> pipeline for python script and so it caused script will not identify each 
> record.Other, filed separator in spark sql will be '^A' or '\001' which is 
> inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
> SELECT
>   c.wcs_user_sk,
>   w.wp_type,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams c, web_page w
> WHERE c.wcs_web_page_sk = w.wp_web_page_sk
> AND   c.wcs_web_page_sk IS NOT NULL
> AND   c.wcs_user_sk IS NOT NULL
> AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
> DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wp_type
>   USING 'python sessionize.py 3600'
>   AS (
> wp_type STRING,
> tstamp BIGINT, 
> sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>  user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed delimeter

2015-08-27 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716264#comment-14716264
 ] 

Yi Zhou commented on SPARK-10310:
-

This issue found in real-world case. Hopefully it can be fixed in Spark 1.5.0 
code. Thanks in advance !

 [Spark SQL] All result records will be popluated in ONE line due to missing 
 the correct line/filed delimeter
 

 Key: SPARK-10310
 URL: https://issues.apache.org/jira/browse/SPARK-10310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yi Zhou
Priority: Blocker

 There is real case using python stream script in Spark SQL query. We found 
 that all result records were wroten in ONE line as input from select 
 pipeline for python script and so it cause script will not identify each 
 record.Other, filed separator in spark sql will be '^A' or '\001' which is 
 inconsistent/incompatible the '\t' in Hive implementation.
 #Key  Query:
 CREATE VIEW temp1 AS
 SELECT *
 FROM
 (
   FROM
   (
 SELECT
   c.wcs_user_sk,
   w.wp_type,
   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
 FROM web_clickstreams c, web_page w
 WHERE c.wcs_web_page_sk = w.wp_web_page_sk
 AND   c.wcs_web_page_sk IS NOT NULL
 AND   c.wcs_user_sk IS NOT NULL
 AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
 DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
   ) clicksAnWebPageType
   REDUCE
 wcs_user_sk,
 tstamp_inSec,
 wp_type
   USING 'python sessionize.py 3600'
   AS (
 wp_type STRING,
 tstamp BIGINT, 
 sessionid STRING)
 ) sessionized
 #Key Python Script#
 for line in sys.stdin:
  user_sk,  tstamp_str, value  = line.strip().split(\t)
 Result Records example from 'select' ##
 ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
 Result Records example in format##
 31   3237764860   feedback
 31   3237769106   dynamic
 31   3237779027   review



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed delimeter

2015-08-27 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10310:

Description: 
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from select pipeline for 
python script and so it cause script will not identify each record.Other, filed 
separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split(\t)

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review


  was:
There is real case using python stream script in Spark SQL query. We found that 
all result records from select  write in ONE line as input for python script 
and so it cause script will not identify each record.Other, filed separator in 
spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in 
Hive implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split(\t)

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review



 [Spark SQL] All result records will be popluated in ONE line due to missing 
 the correct line/filed delimeter
 

 Key: SPARK-10310
 URL: https://issues.apache.org/jira/browse/SPARK-10310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yi Zhou
Priority: Blocker

 There is real case using python stream script in Spark SQL query. We found 
 that all result records were wroten in ONE line as input from select 
 pipeline for python script and so it cause script will not identify each 
 record.Other, filed separator in spark sql will be '^A' or '\001' which is 
 inconsistent/incompatible the '\t' in Hive implementation.
 #Key  Query:
 CREATE VIEW temp1 AS
 SELECT *
 FROM
 (
   FROM
   (
 SELECT
   c.wcs_user_sk,
   w.wp_type,
   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
 FROM web_clickstreams c, web_page w
 WHERE c.wcs_web_page_sk = w.wp_web_page_sk
 AND   c.wcs_web_page_sk IS NOT NULL
 AND   c.wcs_user_sk IS NOT NULL
 AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
 DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
   ) clicksAnWebPageType
   REDUCE
 wcs_user_sk,
 tstamp_inSec,
 wp_type
   USING 'python sessionize.py 3600'
   AS (
 wp_type STRING,
 tstamp BIGINT, 
 sessionid STRING)
 ) sessionized
 #Key Python Script#
 for line in sys.stdin:
  user_sk,  tstamp_str, value  = line.strip().split(\t)
 Result Records example from 'select' ##
 

[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed delimeter

2015-08-27 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10310:

Description: 
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from select pipeline for 
python script and so it caused script will not identify each record.Other, 
filed separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split(\t)

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review


  was:
There is real case using python stream script in Spark SQL query. We found that 
all result records were wroten in ONE line as input from select pipeline for 
python script and so it cause script will not identify each record.Other, filed 
separator in spark sql will be '^A' or '\001' which is 
inconsistent/incompatible the '\t' in Hive implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split(\t)

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review



 [Spark SQL] All result records will be popluated in ONE line due to missing 
 the correct line/filed delimeter
 

 Key: SPARK-10310
 URL: https://issues.apache.org/jira/browse/SPARK-10310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yi Zhou
Priority: Blocker

 There is real case using python stream script in Spark SQL query. We found 
 that all result records were wroten in ONE line as input from select 
 pipeline for python script and so it caused script will not identify each 
 record.Other, filed separator in spark sql will be '^A' or '\001' which is 
 inconsistent/incompatible the '\t' in Hive implementation.
 #Key  Query:
 CREATE VIEW temp1 AS
 SELECT *
 FROM
 (
   FROM
   (
 SELECT
   c.wcs_user_sk,
   w.wp_type,
   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
 FROM web_clickstreams c, web_page w
 WHERE c.wcs_web_page_sk = w.wp_web_page_sk
 AND   c.wcs_web_page_sk IS NOT NULL
 AND   c.wcs_user_sk IS NOT NULL
 AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
 DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
   ) clicksAnWebPageType
   REDUCE
 wcs_user_sk,
 tstamp_inSec,
 wp_type
   USING 'python sessionize.py 3600'
   AS (
 wp_type STRING,
 tstamp BIGINT, 
 sessionid STRING)
 ) sessionized
 #Key Python Script#
 for line in sys.stdin:
  user_sk,  tstamp_str, value  = line.strip().split(\t)
 Result Records example from 'select' ##
 

[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-26 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14715855#comment-14715855
 ] 

Yi Zhou commented on SPARK-9228:


OK. Thanks again [~davies]

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10310) [Spark SQL] All result records will be popluated in ONE line due to missing the correct line/filed separator

2015-08-26 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10310:
---

 Summary: [Spark SQL] All result records will be popluated in ONE 
line due to missing the correct line/filed separator
 Key: SPARK-10310
 URL: https://issues.apache.org/jira/browse/SPARK-10310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yi Zhou
Priority: Blocker


There is real case using python stream script in Spark SQL query. We found that 
all result records from select  write in ONE line as input for python script 
and so it cause script will not identify each record.Other, filed separator in 
spark sql will be '^A' or '\001' which is inconsistent the '\t' in Hive 
implementation.

#Key  Query:
CREATE VIEW temp1 AS
SELECT *
FROM
(
  FROM
  (
SELECT
  c.wcs_user_sk,
  w.wp_type,
  (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND   c.wcs_web_page_sk IS NOT NULL
AND   c.wcs_user_sk IS NOT NULL
AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
  ) clicksAnWebPageType
  REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
  USING 'python sessionize.py 3600'
  AS (
wp_type STRING,
tstamp BIGINT, 
sessionid STRING)
) sessionized

#Key Python Script#
for line in sys.stdin:
 user_sk,  tstamp_str, value  = line.strip().split(\t)

Result Records example from 'select' ##
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
Result Records example in format##
31   3237764860   feedback
31   3237769106   dynamic
31   3237779027   review




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10215) Div of Decimal returns null

2015-08-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711327#comment-14711327
 ] 

Yi Zhou commented on SPARK-10215:
-

This issue cause cases which is relative to 'decimal' type  to fail , so 
hopefully it can be fixed in Spark 1.5.0.
Thanks in advance !

 Div of Decimal returns null
 ---

 Key: SPARK-10215
 URL: https://issues.apache.org/jira/browse/SPARK-10215
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 {code}
 val d = Decimal(1.12321)
 val df = Seq((d, 1)).toDF(a, b)
 df.selectExpr(b * a / b).collect() = Array(Row(null))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-25 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712360#comment-14712360
 ] 

Yi Zhou commented on SPARK-9228:


Hi [~davies]
I also saw the 'spark.unsafe.offHeap' parameter if it is relative to 
'spark.sql.tungsten.enabled' which control 'spark.sql.unsafe.enabled' and 
'spark.sql.codegen' ? It means that if need turn on 'unsafe' , we have to set 
both 'spark.unsafe.offHeap' and 'spark.sql.tungsten.enabled'  to true ?

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10130) type coercion for IF should have children resolved first

2015-08-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706345#comment-14706345
 ] 

Yi Zhou commented on SPARK-10130:
-

Hopefully it can be fixed in Spark 1.5.0 actually it  is blocker issue.

 type coercion for IF should have children resolved first
 

 Key: SPARK-10130
 URL: https://issues.apache.org/jira/browse/SPARK-10130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-21 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706315#comment-14706315
 ] 

Yi Zhou commented on SPARK-9228:


Thanks [~davies] !

 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-20 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706152#comment-14706152
 ] 

Yi Zhou commented on SPARK-9228:


Hi [~davies]
After introducing the 'spark.sql.tungsten.enabled' , it mean that the previous 
2 settings(spark.sql.unsafe.enabled and spark.sql.codegen) will both deprecated 
or removed ,right ?  But currently i still can show the parameters in Spark SQL 
CLI like below:

15/08/21 10:28:54 INFO DAGScheduler: Job 6 finished: processCmd at 
CliDriver.java:376, took 0.191960 s
spark.sql.unsafe.enabledtrue
Time taken: 0.253 seconds, Fetched 1 row(s)

15/08/21 10:34:10 INFO DAGScheduler: Job 7 finished: processCmd at 
CliDriver.java:376, took 0.284666 s
spark.sql.codegen   true
Time taken: 0.336 seconds, Fetched 1 row(s)
15/08/21 10:34:10 INFO CliDriver: Time taken: 0.336 seconds, Fetched 1 row(s)



 Combine unsafe and codegen into a single option
 ---

 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-07-28 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou closed SPARK-5791.
--
Resolution: Done

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7268) [Spark SQL] Throw 'Shutdown hooks cannot be modified during shutdown' on YARN

2015-07-28 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou closed SPARK-7268.
--
Resolution: Done

 [Spark SQL] Throw 'Shutdown hooks cannot be modified during shutdown' on YARN
 -

 Key: SPARK-7268
 URL: https://issues.apache.org/jira/browse/SPARK-7268
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Zhou

 {noformat}
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Interrupting 
 monitor thread
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Asking each 
 executor to shut down
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Stopped
 15/04/30 08:26:32 INFO 
 scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
 OutputCommitCoordinator stopped!
 15/04/30 08:26:32 INFO 
 scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
 OutputCommitCoordinator stopped!
 15/04/30 08:26:32 INFO spark.MapOutputTrackerMasterEndpoint: 
 MapOutputTrackerMasterEndpoint stopped!
 15/04/30 08:26:32 ERROR util.Utils: Uncaught exception in thread Thread-0
 java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
 shutdown.
 at 
 org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
 at 
 org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
 at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
 at 
 org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
 at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
 at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
 at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 15/04/30 08:26:32 WARN util.ShutdownHookManager: ShutdownHook '$anon$6' 
 failed, java.lang.IllegalStateException: Shutdown hooks cannot be modified 
 during shutdown.
 java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
 shutdown.
 at 
 org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
 at 
 org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
 at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
 at 
 org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
 at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
 at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
 at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7268) [Spark SQL] Throw 'Shutdown hooks cannot be modified during shutdown' on YARN

2015-07-28 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644462#comment-14644462
 ] 

Yi Zhou commented on SPARK-7268:


In latest spark 1.5 master code , the issue is not existed . So closed it.

 [Spark SQL] Throw 'Shutdown hooks cannot be modified during shutdown' on YARN
 -

 Key: SPARK-7268
 URL: https://issues.apache.org/jira/browse/SPARK-7268
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Zhou

 {noformat}
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Interrupting 
 monitor thread
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Asking each 
 executor to shut down
 15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Stopped
 15/04/30 08:26:32 INFO 
 scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
 OutputCommitCoordinator stopped!
 15/04/30 08:26:32 INFO 
 scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
 OutputCommitCoordinator stopped!
 15/04/30 08:26:32 INFO spark.MapOutputTrackerMasterEndpoint: 
 MapOutputTrackerMasterEndpoint stopped!
 15/04/30 08:26:32 ERROR util.Utils: Uncaught exception in thread Thread-0
 java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
 shutdown.
 at 
 org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
 at 
 org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
 at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
 at 
 org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
 at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
 at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
 at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 15/04/30 08:26:32 WARN util.ShutdownHookManager: ShutdownHook '$anon$6' 
 failed, java.lang.IllegalStateException: Shutdown hooks cannot be modified 
 during shutdown.
 java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
 shutdown.
 at 
 org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
 at 
 org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
 at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
 at 
 org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
 at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
 at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
 at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
 at 
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
 at 
 org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643724#comment-14643724
 ] 

Yi Zhou commented on SPARK-9374:


 We can't found this issue in spark 1.4. it seems it was introduced in latest 
1.5 code.

 [Spark SQL] Throw out erorr of AnalysisException: nondeterministic 
 expressions are only allowed in Project or Filter during the spark sql parse 
 phase
 ---

 Key: SPARK-9374
 URL: https://issues.apache.org/jira/browse/SPARK-9374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Blocker

 #Spark SQL Query
 INSERT INTO TABLE TEST_QUERY_0_result
 SELECT w_state, i_item_id,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd')  
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - COALESCE(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_before,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd') = 
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - coalesce(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_after
 FROM (
   SELECT *
   FROM web_sales ws
   LEFT OUTER JOIN web_returns wr ON (ws.ws_order_number = wr.wr_order_number
   AND ws.ws_item_sk = wr.wr_item_sk)
 ) a1
 JOIN item i ON a1.ws_item_sk = i.i_item_sk
 JOIN warehouse w ON a1.ws_warehouse_sk = w.w_warehouse_sk
 JOIN date_dim d ON a1.ws_sold_date_sk = d.d_date_sk
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') - 30*24*60*60 --subtract 30 days in seconds
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') + 30*24*60*60 --add 30 days in seconds
 GROUP BY w_state,i_item_id
 CLUSTER BY w_state,i_item_id
 Error Message##
 org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
 allowed in Project or Filter, found:
  (((ws_sold_date_sk = d_date_sk)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType
 in operator Join Inner, Somews_sold_date_sk#289L = d_date_sk#383L)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType)
  ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:148)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 

[jira] [Updated] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-9374:
---
Description: 
#Spark SQL Query
INSERT INTO TABLE TEST_QUERY_0_result
SELECT w_state, i_item_id,
  SUM(
CASE WHEN (unix_timestamp(d_date,'-MM-dd')  
unix_timestamp('2001-03-16','-MM-dd'))
THEN ws_sales_price - COALESCE(wr_refunded_cash,0)
ELSE 0.0 END
  ) AS sales_before,
  SUM(
CASE WHEN (unix_timestamp(d_date,'-MM-dd') = 
unix_timestamp('2001-03-16','-MM-dd'))
THEN ws_sales_price - coalesce(wr_refunded_cash,0)
ELSE 0.0 END
  ) AS sales_after
FROM (
  SELECT *
  FROM web_sales ws
  LEFT OUTER JOIN web_returns wr ON (ws.ws_order_number = wr.wr_order_number
  AND ws.ws_item_sk = wr.wr_item_sk)
) a1
JOIN item i ON a1.ws_item_sk = i.i_item_sk
JOIN warehouse w ON a1.ws_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON a1.ws_sold_date_sk = d.d_date_sk
AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
'-MM-dd') - 30*24*60*60 --subtract 30 days in seconds
AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
'-MM-dd') + 30*24*60*60 --add 30 days in seconds
GROUP BY w_state,i_item_id
CLUSTER BY w_state,i_item_id

Error Message##
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in Project or Filter, found:
 (((ws_sold_date_sk = d_date_sk)  
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
 = 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
 - CAST30 * 24) * 60) * 60), LongType  
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
 = 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
 + CAST30 * 24) * 60) * 60), LongType
in operator Join Inner, Somews_sold_date_sk#289L = d_date_sk#383L)  
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
 = 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
 - CAST30 * 24) * 60) * 60), LongType  
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
 = 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
 + CAST30 * 24) * 60) * 60), LongType)
 ;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:148)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:43)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:976)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131)
at 

[jira] [Created] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-9374:
--

 Summary: [Spark SQL] Throw out erorr of AnalysisException: 
nondeterministic expressions are only allowed in Project or Filter during the 
spark sql parse phase
 Key: SPARK-9374
 URL: https://issues.apache.org/jira/browse/SPARK-9374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9374) [Spark SQL] UDFUnixTimeStamp throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-9374:
---
Summary: [Spark SQL] UDFUnixTimeStamp throw out erorr of 
AnalysisException: nondeterministic expressions are only allowed in Project or 
Filter during the spark sql parse phase  (was: [Spark SQL] Throw out erorr of 
AnalysisException: nondeterministic expressions are only allowed in Project or 
Filter during the spark sql parse phase)

 [Spark SQL] UDFUnixTimeStamp throw out erorr of AnalysisException: 
 nondeterministic expressions are only allowed in Project or Filter during 
 the spark sql parse phase
 

 Key: SPARK-9374
 URL: https://issues.apache.org/jira/browse/SPARK-9374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Blocker

 #Spark SQL Query
 INSERT INTO TABLE TEST_QUERY_0_result
 SELECT w_state, i_item_id,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd')  
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - COALESCE(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_before,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd') = 
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - coalesce(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_after
 FROM (
   SELECT *
   FROM web_sales ws
   LEFT OUTER JOIN web_returns wr ON (ws.ws_order_number = wr.wr_order_number
   AND ws.ws_item_sk = wr.wr_item_sk)
 ) a1
 JOIN item i ON a1.ws_item_sk = i.i_item_sk
 JOIN warehouse w ON a1.ws_warehouse_sk = w.w_warehouse_sk
 JOIN date_dim d ON a1.ws_sold_date_sk = d.d_date_sk
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') - 30*24*60*60 --subtract 30 days in seconds
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') + 30*24*60*60 --add 30 days in seconds
 GROUP BY w_state,i_item_id
 CLUSTER BY w_state,i_item_id
 Error Message##
 org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
 allowed in Project or Filter, found:
  (((ws_sold_date_sk = d_date_sk)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType
 in operator Join Inner, Somews_sold_date_sk#289L = d_date_sk#383L)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType)
  ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:148)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 

[jira] [Updated] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-9374:
---
Summary: [Spark SQL] Throw out erorr of AnalysisException: 
nondeterministic expressions are only allowed in Project or Filter during the 
spark sql parse phase  (was: [Spark SQL] UDFUnixTimeStamp throw out erorr of 
AnalysisException: nondeterministic expressions are only allowed in Project or 
Filter during the spark sql parse phase)

 [Spark SQL] Throw out erorr of AnalysisException: nondeterministic 
 expressions are only allowed in Project or Filter during the spark sql parse 
 phase
 ---

 Key: SPARK-9374
 URL: https://issues.apache.org/jira/browse/SPARK-9374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Blocker

 #Spark SQL Query
 INSERT INTO TABLE TEST_QUERY_0_result
 SELECT w_state, i_item_id,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd')  
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - COALESCE(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_before,
   SUM(
 CASE WHEN (unix_timestamp(d_date,'-MM-dd') = 
 unix_timestamp('2001-03-16','-MM-dd'))
 THEN ws_sales_price - coalesce(wr_refunded_cash,0)
 ELSE 0.0 END
   ) AS sales_after
 FROM (
   SELECT *
   FROM web_sales ws
   LEFT OUTER JOIN web_returns wr ON (ws.ws_order_number = wr.wr_order_number
   AND ws.ws_item_sk = wr.wr_item_sk)
 ) a1
 JOIN item i ON a1.ws_item_sk = i.i_item_sk
 JOIN warehouse w ON a1.ws_warehouse_sk = w.w_warehouse_sk
 JOIN date_dim d ON a1.ws_sold_date_sk = d.d_date_sk
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') - 30*24*60*60 --subtract 30 days in seconds
 AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', 
 '-MM-dd') + 30*24*60*60 --add 30 days in seconds
 GROUP BY w_state,i_item_id
 CLUSTER BY w_state,i_item_id
 Error Message##
 org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
 allowed in Project or Filter, found:
  (((ws_sold_date_sk = d_date_sk)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType
 in operator Join Inner, Somews_sold_date_sk#289L = d_date_sk#383L)  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  - CAST30 * 24) * 60) * 60), LongType  
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd)
  = 
 (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd)
  + CAST30 * 24) * 60) * 60), LongType)
  ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:148)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
   at 
 

[jira] [Commented] (SPARK-7119) ScriptTransform doesn't consider the output data type

2015-07-08 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619672#comment-14619672
 ] 

Yi Zhou commented on SPARK-7119:


The issue blocked Spark SQL query relative to scriptTransform so hopefully it 
can be fixed in 1.5.0

 ScriptTransform doesn't consider the output data type
 -

 Key: SPARK-7119
 URL: https://issues.apache.org/jira/browse/SPARK-7119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Hao

 {code:sql}
 from (from src select transform(key, value) using 'cat' as (thing1 int, 
 thing2 string)) t select thing1 + 2;
 {code}
 {noformat}
 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job 
 aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent 
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): 
 java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be 
 cast to java.lang.Integer
   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
   at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57)
   at 
 org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7623) Spark prints SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use java.net.BindException: Address already in use when run 2 spark in par

2015-05-13 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-7623:
--

 Summary: Spark prints  SelectChannelConnector@0.0.0.0:4040: 
java.net.BindException: Address already in use java.net.BindException: Address 
already in use when run 2 spark in parallel
 Key: SPARK-7623
 URL: https://issues.apache.org/jira/browse/SPARK-7623
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Yi Zhou
Priority: Minor


When run  2 Spark Sql in parallel via Spark SQL CLI, print below warning 
message:
15/05/14 09:17:26 WARN component.AbstractLifeCycle: FAILED 
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in 
use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:437)
at sun.nio.ch.Net.bind(Net.java:429)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at 
org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at 
org.spark-project.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at 
org.spark-project.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.spark-project.jetty.server.Server.doStart(Server.java:293)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:220)
at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:230)
at org.apache.spark.ui.JettyUtils$$anonfun$2.apply(JettyUtils.scala:230)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1943)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1934)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:230)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:120)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:420)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:420)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.init(SparkContext.scala:420)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:50)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLCLIDriver.scala:240)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:129)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:607)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:190)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7268) [Spark SQL] Throw 'Shutdown hooks cannot be modified during shutdown' on YARN

2015-04-29 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-7268:
--

 Summary: [Spark SQL] Throw 'Shutdown hooks cannot be modified 
during shutdown' on YARN
 Key: SPARK-7268
 URL: https://issues.apache.org/jira/browse/SPARK-7268
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Zhou


15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor 
thread
15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Asking each executor 
to shut down
15/04/30 08:26:32 INFO cluster.YarnClientSchedulerBackend: Stopped
15/04/30 08:26:32 INFO 
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
15/04/30 08:26:32 INFO 
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
15/04/30 08:26:32 INFO spark.MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
15/04/30 08:26:32 ERROR util.Utils: Uncaught exception in thread Thread-0
java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
shutdown.
at 
org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
at 
org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
at 
org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
15/04/30 08:26:32 WARN util.ShutdownHookManager: ShutdownHook '$anon$6' failed, 
java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
shutdown.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
shutdown.
at 
org.apache.spark.util.SparkShutdownHookManager.checkState(Utils.scala:2191)
at 
org.apache.spark.util.SparkShutdownHookManager.remove(Utils.scala:2185)
at org.apache.spark.util.Utils$.removeShutdownHook(Utils.scala:2138)
at 
org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:151)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1214)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:94)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1511)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:67)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anonfun$main$1.apply$mcV$sp(SparkSQLCLIDriver.scala:105)
at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2204)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2173)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1724)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2173)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2155)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >