Re: SparkSQL failing while writing into S3 for 'insert into table'

Cheolsoo Park Sat, 23 May 2015 20:25:47 -0700

>> It seems it generated query results into tmp dir firstly, and tries to rename
it into the right folder finally. But, it failed while renaming it.


This problem exists not only in SparkSQL but also in any Hadoop tools (e.g.
Hive, Pig, etc) when using with s3. Usually, It is better to write task
outputs to local disk and copy them to the final S3 location in the task
commit phase. In fact, this is how EMR Hive does insert overwrite, and
that's why EMR Hive works well with S3 while Apache Hive doesn't.

If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop
task. Basically, you can modify that code to make it write to local disk
first and then commit to the final s3 location. Actually, I am doing the
same at work in 1.4 branch.


On Fri, May 22, 2015 at 5:50 PM, ogoh <oke...@gmail.com> wrote:

>
> Hello,
> I am using spark 1.3 & Hive 0.13.1 in AWS.
> From Spark-SQL, when running Hive query to export Hive query result into
> AWS
> S3, it failed with the following message:
> ==
> org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:
>
> s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-10000
> has nested
>
> directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-10000/_temporary
>
> at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)
>
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)
>
> at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)
>
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)
>
> at
>
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)
>
> at
>
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)
>
> at
>
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)
>
> at
>
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
> ==
>
> The query tested is
>
> spark-sql>create external table s3_dwserver_sql_t1 (q string) location
> 's3://test-dev/s3_dwserver_sql_t1')
>
> spark-sql>insert into table s3_dwserver_sql_t1 select q from api_search
> where pdate='2015-05-12' limit 100;
> ==
>
> It seems it generated query results into tmp dir firstly, and tries to
> rename it into the right folder finally. But, it failed while renaming it.
>
> I appreciate any advice.
> Thanks,
> Okehee
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SparkSQL failing while writing into S3 for 'insert into table'

Reply via email to