Re: Error when saving a dataframe as ORC file

Zhan Zhang Sun, 23 Aug 2015 13:25:50 -0700

If you are using spark-1.4.0, probably it is caused by 
SPARK-8458<https://issues.apache.org/jira/browse/SPARK-8458>


Thanks.

Zhan Zhang

On Aug 23, 2015, at 12:49 PM, lostrain A 
<donotlikeworkingh...@gmail.com<mailto:donotlikeworkingh...@gmail.com>> wrote:

Ted,
  Thanks for the suggestions. Actually I tried both s3n and s3 and the result 
remains the same.


On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
In your case, I would specify "fs.s3.awsAccessKeyId" / 
"fs.s3.awsSecretAccessKey" since you use s3 protocol.

On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
<donotlikeworkingh...@gmail.com<mailto:donotlikeworkingh...@gmail.com>> wrote:
Hi Ted,
  Thanks for the reply. I tried setting both of the keyid and accesskey via

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")

However, the error still occurs for ORC format.

If I change the format to JSON, although the error does not go, the JSON files 
can be saved successfully.




On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI



On Aug 23, 2015, at 1:01 AM, lostrain A 
<donotlikeworkingh...@gmail.com<mailto:donotlikeworkingh...@gmail.com>> wrote:

Hi,
  I'm trying to save a simple dataframe to S3 in ORC format. The code is as 
follows:


     val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
      import sqlContext.implicits._
      val df=sc.parallelize(1 to 1000).toDF()
      df.write.format("orc").save("s3://logs/dummy)

I ran the above code in spark-shell and only the _SUCCESS file was saved under 
the directory.
The last part of the spark-shell log said:

15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 
in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)

15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 
(save at <console>:29) finished in 0.834 s

15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, 
whose tasks have all completed, from pool

15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at <console>:29, 
took 0.895912 s

15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: 
Returning directory: /media/ephemeral0/s3/output-

15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is 
[-44, 29, -128, -39, -113, 0, -78,
 4, -23, -103, 9, -104, -20, -8, 66, 126]

15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job_****_**** committed.

Anyone has experienced this before?
Thanks!

Re: Error when saving a dataframe as ORC file

Reply via email to