[jira] [Commented] (SPARK-10765) use new aggregate interface for hive UDAF

2016-03-10 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190042#comment-15190042
 ] 

David Ross commented on SPARK-10765:


As noted in the change, this is a performance regression for Hive UDAFs: 
https://github.com/apache/spark/commit/341b13f8f5eb118f1fb4d4f84418715ac4750a4d#diff-53f31aa4bbd9274f40547cd00cf0826dR526

What is the plan to resolve this?

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-21 Thread David Ross (JIRA)
David Ross created SPARK-11246:
--

 Summary: [1.5] Table cache for Parquet broken in 1.5
 Key: SPARK-11246
 URL: https://issues.apache.org/jira/browse/SPARK-11246
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: David Ross


Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all tables 
except for parquet tables, likely related to the parquet native reader.

Here are steps for parquet table:

{code}
create table test_parquet stored as parquet as select 1;
explain select * from test_parquet;
{code}

With output:

{code}
== Physical Plan ==
Scan 
ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
{code}

And then caching:

{code}
cache table test_parquet;
explain select * from test_parquet;
{code}

With output:

{code}
== Physical Plan ==
Scan 
ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
{code}

Note it isn't cached. I have included spark log output for the {{cache table}} 
and {{explain}} statements below.

---

Here's the same for non-parquet table:

{code}
cache table test_no_parquet;
explain select * from test_no_parquet;
{code}

With output:

{code}
== Physical Plan ==
HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
{code}

And then caching:

{code}
cache table test_no_parquet;
explain select * from test_no_parquet;
{code}

With output:

{code}
== Physical Plan ==
InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 1, 
StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
(MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
{code}

Not that the table seems to be cached.
---

Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
{{false}}, parquet tables work the same as non-parquet tables with caching. 
This is a reasonable workaround for us, but ideally, we would like to benefit 
from the native reading.

---

Spark logs for {{cache table}} for {{test_parquet}}:

{code}
15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
tbl=test_parquet
15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant ip=unknown-ip-addr  
cmd=get_table : db=default tbl=test_parquet
15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
underlying DB is MYSQL
15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called with 
curMem=4196713, maxMem=139009720
15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as values 
in memory (estimated size 210.6 KB, free 128.4 MB)
15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called with 
curMem=4412393, maxMem=139009720
15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored as 
bytes in memory (estimated size 19.8 KB, free 128.3 MB)
15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
AccessController.java:-2
15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
tbl=test_parquet
15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant ip=unknown-ip-addr  
cmd=get_table : db=default tbl=test_parquet
15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called with 
curMem=4432658, maxMem=139009720
15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as values 
in memory (estimated size 210.6 KB, free 128.1 MB)
15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 on 
192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 on 
192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB)
15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 on 
slave2:46912 in memory (size: 21.1 KB, free: 534.5 MB)
15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 on 
slave0:46599 in memory (size: 21.1 KB, free: 534.3 MB)
15/10/21 21:22:05 INFO spark.ContextCleaner: Cleaned accumulator 86
15/10/21 21:22:05 INFO spark.ContextCleaner: Cleaned accumulator 84
15/10/21 21:22:05 INFO 

[jira] [Created] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-10-19 Thread David Ross (JIRA)
David Ross created SPARK-11191:
--

 Summary: [1.5] Can't create UDF's using hive thrift service
 Key: SPARK-11191
 URL: https://issues.apache.org/jira/browse/SPARK-11191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: David Ross


Since upgrading to spark 1.5 we've been unable to create and use UDF's when we 
run in thrift server mode.

Our setup:
We start the thrift-server running against yarn in client mode, (we've also 
built our own spark from github branch-1.5 with the following args: {{-Pyarn 
-Phive -Phive-thrifeserver}}

If i run the following after connecting via JDBC (in this case via beeline):

{{add jar 'hdfs://path/to/jar"}}
(this command succeeds with no errors)

{{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
(this command succeeds with no errors)

{{select testUDF(col1) from table1;}}

I get the following error in the logs:

{code}
org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 8
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
at scala.util.Try.getOrElse(Try.scala:77)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
{code}


(cutting the bulk for ease of report, more than happy to send the full output)

{code}
15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 
100
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


When I ran the same against 1.4 it worked.

I've also changed the {{spark.sql.hive.metastore.version}} version to be 0.13 
(similar to what it was in 1.4) and 0.14 but I still get the same errors.

Also, in 1.5, when you run it against the {{spark-sql}} shell, it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-10-19 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964048#comment-14964048
 ] 

David Ross commented on SPARK-11191:


I will add that the exact same thing happens when you don't use {{TEMPORARY}} 
i.e.:

{code}
CREATE FUNCTION testUDF AS 'com.foo.class.UDF';
{code}

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at 

[jira] [Commented] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-09-15 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746320#comment-14746320
 ] 

David Ross commented on SPARK-5391:
---

Haven't tried native JSON but looks promising, so this ticket is probably lower 
priority.

> SparkSQL fails to create tables with custom JSON SerDe
> --
>
> Key: SPARK-5391
> URL: https://issues.apache.org/jira/browse/SPARK-5391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Ross
>
> - Using Spark built from trunk on this commit: 
> https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
> - Build for Hive13
> - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
> First download jar locally:
> {code}
> $ curl 
> http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
>  > /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Then add it in SparkSQL session:
> {code}
> add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
> {code}
> Finally create table:
> {code}
> create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe';
> {code}
> Logs for add jar:
> {code}
> 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
> 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:48:34 INFO SessionState: Added 
> /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
> 15/01/23 23:48:34 INFO SessionState: Added resource: 
> /tmp/json-serde-1.3-jar-with-dependencies.jar
> 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
> /tmp/json-serde-1.3-jar-with-dependencies.jar at 
> http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
> timestamp 1422056914776
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
> Schema: List()
> {code}
> Logs (with error) for create table:
> {code}
> 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
> 'org.openx.data.jsonserde.JsonSerDe''
> 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
> point. hive.execution.engine=mr.
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
> a lock manager
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
> test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941103 end=1422056941104 duration=1 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
> position=13
> 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941104 end=1422056941240 duration=136 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:null, properties:null)
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941071 end=1422056941252 duration=181 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
> (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> 15/01/23 23:49:01 INFO log.PerfLogger:  start=1422056941067 end=1422056941258 duration=191 
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception 
> trying to get groups for user anonymous
> org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
>   at 

[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.

2015-04-14 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494428#comment-14494428
 ] 

David Ross commented on SPARK-2087:
---

Makes sense, thanks for the response.

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.4.0


 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.

2015-04-14 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494410#comment-14494410
 ] 

David Ross commented on SPARK-2087:
---

Any chance this will be back-ported to the 1.3 branch?

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.4.0


 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6757) spark.sql.shuffle.partitions is global, not per connection

2015-04-07 Thread David Ross (JIRA)
David Ross created SPARK-6757:
-

 Summary: spark.sql.shuffle.partitions is global, not per connection
 Key: SPARK-6757
 URL: https://issues.apache.org/jira/browse/SPARK-6757
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: David Ross


We are trying to use the {{spark.sql.shuffle.partitions}} parameter to handle 
large queries differently from smaller queries. We expected that this parameter 
would be respected per connection, but it seems to be global.

For example, in try this in two separate JDBC connections:

Connection 1:
{code}
SET spark.sql.shuffle.partitions=10;
SELECT * FROM some_table;
{code}

The correct number {{10}} was used.

Connection 2:
{code}
SET spark.sql.shuffle.partitions=100;
SELECT * FROM some_table;
{code}

The correct number {{100}} was used.

Back to connection 1:
{code}
SELECT * FROM some_table;
{code}

We expected the number {{10}} to be used but {{100}} is used.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6482) Remove synchronization of Hive Native commands

2015-03-23 Thread David Ross (JIRA)
David Ross created SPARK-6482:
-

 Summary: Remove synchronization of Hive Native commands
 Key: SPARK-6482
 URL: https://issues.apache.org/jira/browse/SPARK-6482
 Project: Spark
  Issue Type: Improvement
Reporter: David Ross


As discussed in https://issues.apache.org/jira/browse/SPARK-4908, concurrent 
hive native commands run into thread-safety issues with 
{{org.apache.hadoop.hive.ql.Driver}}.

The quick-fix was to synchronize calls to {{runHive}}:

https://github.com/apache/spark/commit/480bd1d2edd1de06af607b0cf3ff3c0b16089add

However, if the hive native command is long-running, this can block subsequent 
queries if they have native dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-01-23 Thread David Ross (JIRA)
David Ross created SPARK-5391:
-

 Summary: SparkSQL fails to create tables with custom JSON SerDe
 Key: SPARK-5391
 URL: https://issues.apache.org/jira/browse/SPARK-5391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross


- Using Spark built from trunk on this commit: 
https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
- Build for Hive13
- Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde

First download jar locally:
{code}
$ curl 
http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
  /tmp/json-serde-1.3-jar-with-dependencies.jar
{code}

Then add it in SparkSQL session:
{code}
add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
{code}

Finally create table:
{code}
create table test_json (c1 boolean) ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe';
{code}

Logs for add jar:
{code}
15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
point. hive.execution.engine=mr.
15/01/23 23:48:34 INFO SessionState: Added 
/tmp/json-serde-1.3-jar-with-dependencies.jar to class path
15/01/23 23:48:34 INFO SessionState: Added resource: 
/tmp/json-serde-1.3-jar-with-dependencies.jar
15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
/tmp/json-serde-1.3-jar-with-dependencies.jar at 
http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
timestamp 1422056914776
15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
Schema: List()
15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
Schema: List()
{code}

Logs (with error) for create table:
{code}
15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe''
15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
point. hive.execution.engine=mr.
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.run 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=TimeToSubmit 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=compile 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=parse 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=parse 
start=1422056941103 end=1422056941104 duration=1 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=semanticAnalyze 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
position=13
15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze 
start=1422056941104 end=1422056941240 duration=136 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
Schema(fieldSchemas:null, properties:null)
15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=compile 
start=1422056941071 end=1422056941252 duration=181 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.execute 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json (c1 
boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit 
start=1422056941067 end=1422056941258 duration=191 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=runTasks 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=task.DDL.Stage-0 
from=org.apache.hadoop.hive.ql.Driver
15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception 
trying to get groups for user anonymous
org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
  at org.apache.hadoop.util.Shell.run(Shell.java:418)
  at 

[jira] [Created] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery

2015-01-22 Thread David Ross (JIRA)
David Ross created SPARK-5371:
-

 Summary: SparkSQL Fails to parse Query with UNION ALL in subquery
 Key: SPARK-5371
 URL: https://issues.apache.org/jira/browse/SPARK-5371
 Project: Spark
  Issue Type: Bug
Reporter: David Ross


This SQL session:

{code}
DROP TABLE
test1;
DROP TABLE
test2;
CREATE TABLE
test1
(
c11 INT,
c12 INT,
c13 INT,
c14 INT
);
CREATE TABLE
test2
(
c21 INT,
c22 INT,
c23 INT,
c24 INT
);
SELECT
MIN(t3.c_1),
MIN(t3.c_2),
MIN(t3.c_3),
MIN(t3.c_4)
FROM
(
SELECT
SUM(t1.c11) c_1,
NULLc_2,
NULLc_3,
NULLc_4
FROM
test1 t1
UNION ALL
SELECT
NULLc_1,
SUM(t2.c22) c_2,
SUM(t2.c23) c_3,
SUM(t2.c24) c_4
FROM
test2 t2 ) t3; 
{code}

Produces this error:

{code}
15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
query 'SELECT
MIN(t3.c_1),
MIN(t3.c_2),
MIN(t3.c_3),
MIN(t3.c_4)
FROM
(
SELECT
SUM(t1.c11) c_1,
NULLc_2,
NULLc_3,
NULLc_4
FROM
test1 t1
UNION ALL
SELECT
NULLc_1,
SUM(t2.c22) c_2,
SUM(t2.c23) c_3,
SUM(t2.c24) c_4
FROM
test2 t2 ) t3'
15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
MIN(t3.c_1),
MIN(t3.c_2),
MIN(t3.c_3),
MIN(t3.c_4)
FROM
(
SELECT
SUM(t1.c11) c_1,
NULLc_2,
NULLc_3,
NULLc_4
FROM
test1 t1
UNION ALL
SELECT
NULLc_1,
SUM(t2.c22) c_2,
SUM(t2.c23) c_3,
SUM(t2.c24) c_4
FROM
test2 t2 ) t3
15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
executing query:
java.util.NoSuchElementException: key not found: c_2#23488
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
at 
org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 

[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2015-01-07 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268892#comment-14268892
 ] 

David Ross commented on SPARK-4908:
---

I've verified that this is fixed on trunk. Since his commit says just a quick 
fix, I will let [~marmbrus] decide whether or not to keep this JIRA open.

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 

[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-22 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174
 ] 

David Ross commented on SPARK-4908:
---

Note that noticed this line from native Hive logging:

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
Reporter: David Ross

 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 

[jira] [Comment Edited] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-22 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256174#comment-14256174
 ] 

David Ross edited comment on SPARK-4908 at 12/22/14 8:43 PM:
-

Note that I noticed this line in the logs that seems to come from Hive logging 
(not spark code):

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?


was (Author: dyross):
Note that noticed this line from native Hive logging:

{code}
14/12/19 21:44:55 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
{code}

It seems to be tied to this config:
https://github.com/apache/hive/blob/branch-0.13/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L719

I have this to our {{hive-site.xml}} in the spark {{conf}} directory:

{code}
property
  namehive.support.concurrency/name
  valuetrue/value
/property
{code}

And I still have the issue.

Perhaps there is more I need to do to support concurrency?

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
Reporter: David Ross

 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Failed: 3
 Finishing: 3
 java.sql.SQLException: 
 

[jira] [Created] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries

2014-12-19 Thread David Ross (JIRA)
David Ross created SPARK-4908:
-

 Summary: Spark SQL built for Hive 13 fails under concurrent 
metadata queries
 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
Reporter: David Ross


We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6

We are using Spark built for Hive 13, using this option:
{{-Phive-0.13.1}}

In single-threaded mode, normal operations look fine. However, under 
concurrency, with at least 2 concurrent connections, metadata queries fail.

For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
statement when you pass a default schema in the JDBC URL, all fail.

{{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.

Here is some example code:

{code}
object main extends App {
  import java.sql._
  import scala.concurrent._
  import scala.concurrent.duration._
  import scala.concurrent.ExecutionContext.Implicits.global

  Class.forName(org.apache.hive.jdbc.HiveDriver)

  val host = localhost // update this
  val url = sjdbc:hive2://${host}:10511/some_db // update this

  val future = Future.traverse(1 to 3) { i =
Future {
  println(Starting:  + i)
  try {
val conn = DriverManager.getConnection(url)
  } catch {
case e: Throwable = e.printStackTrace()
println(Failed:  + i)
  }
  println(Finishing:  + i)
}
  }

  Await.result(future, 2.minutes)

  println(done!)
}
{code}

Here is the output:

{code}
Starting: 1
Starting: 3
Starting: 2
java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: 
FAILED: Operation cancelled
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
at 
org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Failed: 3
Finishing: 3
java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: 
FAILED: Operation cancelled
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
at 
org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
at 
com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2014-12-17 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250611#comment-14250611
 ] 

David Ross commented on SPARK-4296:
---

I can still reproduce this issue. The test case above does appear to be fixed, 
but if you use other types of agg functions, it can fail. For example:

{code}
CREATE TABLE test_spark_4296(s STRING);
SELECT UPPER(s) FROM test_spark_4296 GROUP BY UPPER(s);
{code}

That works. But this query doesn't:

{code}
SELECT REGEXP_EXTRACT(s, .*, 1) FROM test_spark_4296 GROUP BY 
REGEXP_EXTRACT(s, .*, 1);
{code}

The error is similar to the one above:

{code}
14/12/17 21:39:22 INFO thriftserver.SparkExecuteStatementOperation: Running 
query 'SELECT REGEXP_EXTRACT(s, .*, 1) FROM test_spark_4296 GROUP BY 
REGEXP_EXTRACT(s, .*, 1)'
14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 
slave0:50816 in memory (size: 5.2 KB, free: 534.4 MB)
14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 
slave1:45411 in memory (size: 5.2 KB, free: 534.4 MB)
14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 
slave2:59650 in memory (size: 5.2 KB, free: 534.4 MB)
14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 7
14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_7_piece0
14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_7_piece0 of size 
5308 dropped from memory (free 276233416)
14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 
master:34621 in memory (size: 5.2 KB, free: 265.0 MB)
14/12/17 21:39:22 INFO storage.BlockManagerMaster: Updated info of block 
broadcast_7_piece0
14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_7
14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_7 of size 9344 
dropped from memory (free 276242760)
14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned broadcast 7
14/12/17 21:39:22 INFO parse.ParseDriver: Parsing command: SELECT 
REGEXP_EXTRACT(s, .*, 1) FROM test_spark_4296 GROUP BY REGEXP_EXTRACT(s, 
.*, 1)
14/12/17 21:39:22 INFO parse.ParseDriver: Parse Completed
14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned shuffle 1
14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 6
14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_6_piece0
14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_6_piece0 of size 
47235 dropped from memory (free 276289995)
14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on 
master:34621 in memory (size: 46.1 KB, free: 265.0 MB)
14/12/17 21:39:22 INFO storage.BlockManagerMaster: Updated info of block 
broadcast_6_piece0
14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_6
14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_6 of size 523775 
dropped from memory (free 276813770)
14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned broadcast 6
14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 5
14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_5_piece0
14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_5_piece0 of size 
7179 dropped from memory (free 276820949)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 
master:34621 in memory (size: 7.0 KB, free: 265.0 MB)
14/12/17 21:39:23 INFO storage.BlockManagerMaster: Updated info of block 
broadcast_5_piece0
14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_5
14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_5 of size 12784 
dropped from memory (free 276833733)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 
slave0:50816 in memory (size: 7.0 KB, free: 534.4 MB)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 
slave1:45411 in memory (size: 7.0 KB, free: 534.4 MB)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on 
slave2:59650 in memory (size: 7.0 KB, free: 534.4 MB)
14/12/17 21:39:23 INFO spark.ContextCleaner: Cleaned broadcast 5
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 
slave1:45411 in memory (size: 7.9 KB, free: 534.4 MB)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 
slave2:59650 in memory (size: 7.9 KB, free: 534.4 MB)
14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 
slave0:50816 in memory (size: 7.9 KB, free: 534.4 MB)
14/12/17 21:39:23 ERROR thriftserver.SparkExecuteStatementOperation: Error 
executing query:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1) AS 
_c0#608, tree:
Aggregate 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1)], 

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2014-12-17 Thread David Ross (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250674#comment-14250674
 ] 

David Ross commented on SPARK-4296:
---

Hi Michael, We are trunk: {{1.3.0-SNAPSHOT}}, as of 
https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4773) CTAS Doesn't Use the Current Schema

2014-12-08 Thread David Ross (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ross resolved SPARK-4773.
---
Resolution: Fixed

Looks like this was broken by: 
https://github.com/apache/spark/commit/4b55482abf899c27da3d55401ad26b4e9247b327

And fixed by: 
https://github.com/apache/spark/commit/51b1fe1426ffecac6c4644523633ea1562ff9a4e

Thanks for quick turnaround!

 CTAS Doesn't Use the Current Schema
 ---

 Key: SPARK-4773
 URL: https://issues.apache.org/jira/browse/SPARK-4773
 Project: Spark
  Issue Type: Bug
Reporter: David Ross

 In a CTAS  (CREATE TABLE __ AS SELECT __), the current schema isn't used. For 
 example, this all works:
 {code}
 CREATE DATABASE test_db;
 USE test_db;
 CREATE TABLE test_table_1(s string);
 SELECT * FROM test_table_1;
 CREATE TABLE test_table_2 AS SELECT * FROM test_db.test_table_1;
 SELECT * FROM test_table_2;
 {code}
 But this fails:
 {code}
 CREATE TABLE test_table_3 AS SELECT * FROM test_table_1;
 {code}
 Message:
 {code}
 14/12/06 00:28:57 ERROR thriftserver.SparkExecuteStatementOperation: Error 
 executing query:
 org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:43 Table not found 
 'test_table_1'
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1324)
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1053)
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8342)
   at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
   at 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
   at 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
   at 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
   at 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
   at 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim12.scala:190)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
   at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
   at 
 org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
   at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
   at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
   at 
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
   at 
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
   at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:43 Table 
 not found 'test_table_1'
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1079)
   ... 33 more
 14/12/06 00:28:57 WARN thrift.ThriftCLIService: Error fetching results:
 org.apache.hive.service.cli.HiveSQLException: 
 

[jira] [Created] (SPARK-4773) CTAS Doesn't Use the Current Schema

2014-12-05 Thread David Ross (JIRA)
David Ross created SPARK-4773:
-

 Summary: CTAS Doesn't Use the Current Schema
 Key: SPARK-4773
 URL: https://issues.apache.org/jira/browse/SPARK-4773
 Project: Spark
  Issue Type: Bug
Reporter: David Ross


In a CTAS  (CREATE TABLE __ AS SELECT __), the current schema isn't used. For 
example, this all works:

{code}
CREATE DATABASE test_db;
USE test_db;
CREATE TABLE test_table_1(s string);
SELECT * FROM test_table_1;
CREATE TABLE test_table_2 AS SELECT * FROM test_db.test_table_1;
SELECT * FROM test_table_2;
{code}

But this fails:

{code}
CREATE TABLE test_table_3 AS SELECT * FROM test_table_1;
{code}

Message:
{code}
14/12/06 00:28:57 ERROR thriftserver.SparkExecuteStatementOperation: Error 
executing query:
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:43 Table not found 
'test_table_1'
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1324)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1053)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8342)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim12.scala:190)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
at 
org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
at 
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
at 
org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:43 Table 
not found 'test_table_1'
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1079)
... 33 more
14/12/06 00:28:57 WARN thrift.ThriftCLIService: Error fetching results:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:43 Table not found 
'test_table_1'
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim12.scala:221)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
at