Re: Error when saving a dataframe as ORC file

lostrain A Sun, 23 Aug 2015 14:07:45 -0700

Hi Zhan,
  Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
looks like this is most likely the reason. I'll verify this again once the
we make the upgrade.


Best,
los

On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:

> If you are using spark-1.4.0, probably it is caused by SPARK-8458
> <https://issues.apache.org/jira/browse/SPARK-8458>
>
> Thanks.
>
> Zhan Zhang
>
> On Aug 23, 2015, at 12:49 PM, lostrain A <donotlikeworkingh...@gmail.com>
> wrote:
>
> Ted,
>   Thanks for the suggestions. Actually I tried both s3n and s3 and the
> result remains the same.
>
>
> On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> In your case, I would specify "fs.s3.awsAccessKeyId" /
>> "fs.s3.awsSecretAccessKey" since you use s3 protocol.
>>
>> On Sun, Aug 23, 2015 at 11:03 AM, lostrain A <
>> donotlikeworkingh...@gmail.com> wrote:
>>
>>> Hi Ted,
>>>   Thanks for the reply. I tried setting both of the keyid and accesskey
>>> via
>>>
>>> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
>>>> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")
>>>
>>>
>>> However, the error still occurs for ORC format.
>>>
>>> If I change the format to JSON, although the error does not go, the JSON
>>> files can be saved successfully.
>>>
>>>
>>>
>>>
>>> On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> You may have seen this:
>>>> http://search-hadoop.com/m/q3RTtdSyM52urAyI
>>>>
>>>>
>>>>
>>>> On Aug 23, 2015, at 1:01 AM, lostrain A <donotlikeworkingh...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>   I'm trying to save a simple dataframe to S3 in ORC format. The code
>>>> is as follows:
>>>>
>>>>
>>>>      val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>>       import sqlContext.implicits._
>>>>>       val df=sc.parallelize(1 to 1000).toDF()
>>>>>       df.write.format("orc").save("s3://logs/dummy)
>>>>
>>>>
>>>> I ran the above code in spark-shell and only the _SUCCESS file was
>>>> saved under the directory.
>>>> The last part of the spark-shell log said:
>>>>
>>>> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
>>>>> task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
>>>>> (100/100)
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
>>>>> ResultStage 2 (save at <console>:29) finished in 0.834 s
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
>>>>> TaskSet 2.0, whose tasks have all completed, from pool
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
>>>>> <console>:29, took 0.895912 s
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:24 main INFO
>>>>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
>>>>> /media/ephemeral0/s3/output-
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
>>>>> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>>>>>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>>>>>
>>>>
>>>>
>>>>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job_****_****
>>>>> committed.
>>>>
>>>>
>>>> Anyone has experienced this before?
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Error when saving a dataframe as ORC file

Reply via email to