Re: SparkSQL failing while writing into S3 for 'insert into table'

2015-05-23 Thread Cheolsoo Park
 It seems it generated query results into tmp dir firstly, and tries to rename
it into the right folder finally. But, it failed while renaming it.

This problem exists not only in SparkSQL but also in any Hadoop tools (e.g.
Hive, Pig, etc) when using with s3. Usually, It is better to write task
outputs to local disk and copy them to the final S3 location in the task
commit phase. In fact, this is how EMR Hive does insert overwrite, and
that's why EMR Hive works well with S3 while Apache Hive doesn't.

If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop
task. Basically, you can modify that code to make it write to local disk
first and then commit to the final s3 location. Actually, I am doing the
same at work in 1.4 branch.


On Fri, May 22, 2015 at 5:50 PM, ogoh oke...@gmail.com wrote:


 Hello,
 I am using spark 1.3  Hive 0.13.1 in AWS.
 From Spark-SQL, when running Hive query to export Hive query result into
 AWS
 S3, it failed with the following message:
 ==
 org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:

 s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1
 has nested

 directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary

 at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)

 at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)

 at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)

 at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)

 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
 ==

 The query tested is

 spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location
 's3://test-dev/s3_dwserver_sql_t1')

 spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search
 where pdate='2015-05-12' limit 100;
 ==

 It seems it generated query results into tmp dir firstly, and tries to
 rename it into the right folder finally. But, it failed while renaming it.

 I appreciate any advice.
 Thanks,
 Okehee





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




SparkSQL failing while writing into S3 for 'insert into table'

2015-05-22 Thread ogoh

Hello, 
I am using spark 1.3  Hive 0.13.1 in AWS.
From Spark-SQL, when running Hive query to export Hive query result into AWS
S3, it failed with the following message:
==
org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:
s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1
has nested
directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary

at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)

at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)

at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)

at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
==

The query tested is 

spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location
's3://test-dev/s3_dwserver_sql_t1')

spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search
where pdate='2015-05-12' limit 100;
==

It seems it generated query results into tmp dir firstly, and tries to
rename it into the right folder finally. But, it failed while renaming it. 

I appreciate any advice.
Thanks,
Okehee

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org