Re: Error when saving a dataframe as ORC file
Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote: In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
SPARK-8458 is in 1.4.1 release. You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release. On Sun, Aug 23, 2015 at 2:05 PM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Zhan, Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it looks like this is most likely the reason. I'll verify this again once the we make the upgrade. Best, los On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com wrote: If you are using spark-1.4.0, probably it is caused by SPARK-8458 https://issues.apache.org/jira/browse/SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.com wrote: Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote: In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
If you are using spark-1.4.0, probably it is caused by SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote: Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!
Re: Error when saving a dataframe as ORC file
Hi Zhan, Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it looks like this is most likely the reason. I'll verify this again once the we make the upgrade. Best, los On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com wrote: If you are using spark-1.4.0, probably it is caused by SPARK-8458 https://issues.apache.org/jira/browse/SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.com wrote: Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote: In your case, I would specify fs.s3.awsAccessKeyId / fs.s3.awsSecretAccessKey since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***) sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **) However, the error still occurs for ORC format. If I change the format to JSON, although the error does not go, the JSON files can be saved successfully. On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote: You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com wrote: Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) import sqlContext.implicits._ val df=sc.parallelize(1 to 1000).toDF() df.write.format(orc).save(s3://logs/dummy) I ran the above code in spark-shell and only the _SUCCESS file was saved under the directory. The last part of the spark-shell log said: 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100) 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 (save at console:29) finished in 0.834 s 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, took 0.895912 s 15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: /media/ephemeral0/s3/output- 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78, 4, -23, -103, 9, -104, -20, -8, 66, 126] 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed. Anyone has experienced this before? Thanks!