SPARK-8458 is in 1.4.1 release. You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release.
On Sun, Aug 23, 2015 at 2:05 PM, lostrain A <donotlikeworkingh...@gmail.com> wrote: > Hi Zhan, > Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it > looks like this is most likely the reason. I'll verify this again once the > we make the upgrade. > > Best, > los > > On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang <zzh...@hortonworks.com> > wrote: > >> If you are using spark-1.4.0, probably it is caused by SPARK-8458 >> <https://issues.apache.org/jira/browse/SPARK-8458> >> >> Thanks. >> >> Zhan Zhang >> >> On Aug 23, 2015, at 12:49 PM, lostrain A <donotlikeworkingh...@gmail.com> >> wrote: >> >> Ted, >> Thanks for the suggestions. Actually I tried both s3n and s3 and the >> result remains the same. >> >> >> On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> In your case, I would specify "fs.s3.awsAccessKeyId" / >>> "fs.s3.awsSecretAccessKey" since you use s3 protocol. >>> >>> On Sun, Aug 23, 2015 at 11:03 AM, lostrain A < >>> donotlikeworkingh...@gmail.com> wrote: >>> >>>> Hi Ted, >>>> Thanks for the reply. I tried setting both of the keyid and accesskey >>>> via >>>> >>>> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***") >>>>> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**") >>>> >>>> >>>> However, the error still occurs for ORC format. >>>> >>>> If I change the format to JSON, although the error does not go, the >>>> JSON files can be saved successfully. >>>> >>>> >>>> >>>> >>>> On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> You may have seen this: >>>>> http://search-hadoop.com/m/q3RTtdSyM52urAyI >>>>> >>>>> >>>>> >>>>> On Aug 23, 2015, at 1:01 AM, lostrain A < >>>>> donotlikeworkingh...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> I'm trying to save a simple dataframe to S3 in ORC format. The code >>>>> is as follows: >>>>> >>>>> >>>>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) >>>>>> import sqlContext.implicits._ >>>>>> val df=sc.parallelize(1 to 1000).toDF() >>>>>> df.write.format("orc").save("s3://logs/dummy) >>>>> >>>>> >>>>> I ran the above code in spark-shell and only the _SUCCESS file was >>>>> saved under the directory. >>>>> The last part of the spark-shell log said: >>>>> >>>>> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished >>>>>> task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal >>>>>> (100/100) >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: >>>>>> ResultStage 2 (save at <console>:29) finished in 0.834 s >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed >>>>>> TaskSet 2.0, whose tasks have all completed, from pool >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at >>>>>> <console>:29, took 0.895912 s >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:24 main INFO >>>>>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: >>>>>> /media/ephemeral0/s3/output- >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for >>>>>> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, >>>>>> 4, -23, -103, 9, -104, -20, -8, 66, 126] >>>>>> >>>>> >>>>> >>>>>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job_****_**** >>>>>> committed. >>>>> >>>>> >>>>> Anyone has experienced this before? >>>>> Thanks! >>>>> >>>>> >>>>> >>>> >>> >> >> >