Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Ted,
  Thanks for the reply. I tried setting both of the keyid and accesskey via

sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)


However, the error still occurs for ORC format.

If I change the format to JSON, although the error does not go, the JSON
files can be saved successfully.




On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code is
 as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was saved
 under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task
 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)



 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage
 2 (save at console:29) finished in 0.834 s



 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool



 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 console:29, took 0.895912 s



 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-



 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]



 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.


 Anyone has experienced this before?
 Thanks!





Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Ted,
  Thanks for the suggestions. Actually I tried both s3n and s3 and the
result remains the same.


On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 In your case, I would specify fs.s3.awsAccessKeyId /
 fs.s3.awsSecretAccessKey since you use s3 protocol.

 On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
 donotlikeworkingh...@gmail.com wrote:

 Hi Ted,
   Thanks for the reply. I tried setting both of the keyid and accesskey
 via

 sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)


 However, the error still occurs for ORC format.

 If I change the format to JSON, although the error does not go, the JSON
 files can be saved successfully.




 On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code is
 as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was saved
 under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
 task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
 (100/100)



 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
 ResultStage 2 (save at console:29) finished in 0.834 s



 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool



 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 console:29, took 0.895912 s



 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-



 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]



 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.


 Anyone has experienced this before?
 Thanks!







Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
In your case, I would specify fs.s3.awsAccessKeyId /
fs.s3.awsSecretAccessKey since you use s3 protocol.

On Sun, Aug 23, 2015 at 11:03 AM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Hi Ted,
   Thanks for the reply. I tried setting both of the keyid and accesskey via

 sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)


 However, the error still occurs for ORC format.

 If I change the format to JSON, although the error does not go, the JSON
 files can be saved successfully.




 On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code is
 as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was saved
 under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task
 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)



 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
 ResultStage 2 (save at console:29) finished in 0.834 s



 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool



 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 console:29, took 0.895912 s



 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-



 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]



 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.


 Anyone has experienced this before?
 Thanks!






Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com 
 wrote:
 
 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code is as 
 follows:
 
 
  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)
 
 I ran the above code in spark-shell and only the _SUCCESS file was saved 
 under the directory.
 The last part of the spark-shell log said:
 
 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 
 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)
  
 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 
 (save at console:29) finished in 0.834 s
  
 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 
 2.0, whose tasks have all completed, from pool
  
 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at 
 console:29, took 0.895912 s
  
 15/08/23 07:38:24 main INFO 
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: 
 /media/ephemeral0/s3/output-
  
 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS 
 is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]
  
 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ 
 committed.
 
 Anyone has experienced this before?
 Thanks!
  


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
SPARK-8458 is in 1.4.1 release.

You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release.

On Sun, Aug 23, 2015 at 2:05 PM, lostrain A donotlikeworkingh...@gmail.com
wrote:

 Hi Zhan,
   Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
 looks like this is most likely the reason. I'll verify this again once the
 we make the upgrade.

 Best,
 los

 On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com
 wrote:

 If you are using spark-1.4.0, probably it is caused by SPARK-8458
 https://issues.apache.org/jira/browse/SPARK-8458

 Thanks.

 Zhan Zhang

 On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Ted,
   Thanks for the suggestions. Actually I tried both s3n and s3 and the
 result remains the same.


 On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 In your case, I would specify fs.s3.awsAccessKeyId /
 fs.s3.awsSecretAccessKey since you use s3 protocol.

 On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
 donotlikeworkingh...@gmail.com wrote:

 Hi Ted,
   Thanks for the reply. I tried setting both of the keyid and accesskey
 via

 sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)


 However, the error still occurs for ORC format.

 If I change the format to JSON, although the error does not go, the
 JSON files can be saved successfully.




 On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A 
 donotlikeworkingh...@gmail.com wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code
 is as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was
 saved under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
 task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
 (100/100)



 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
 ResultStage 2 (save at console:29) finished in 0.834 s



 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool



 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 console:29, took 0.895912 s



 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-



 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]



 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.


 Anyone has experienced this before?
 Thanks!










Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Zhan Zhang
If you are using spark-1.4.0, probably it is caused by 
SPARK-8458https://issues.apache.org/jira/browse/SPARK-8458

Thanks.

Zhan Zhang

On Aug 23, 2015, at 12:49 PM, lostrain A 
donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote:

Ted,
  Thanks for the suggestions. Actually I tried both s3n and s3 and the result 
remains the same.


On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu 
yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote:
In your case, I would specify fs.s3.awsAccessKeyId / 
fs.s3.awsSecretAccessKey since you use s3 protocol.

On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote:
Hi Ted,
  Thanks for the reply. I tried setting both of the keyid and accesskey via

sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)

However, the error still occurs for ORC format.

If I change the format to JSON, although the error does not go, the JSON files 
can be saved successfully.




On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu 
yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote:
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI



On Aug 23, 2015, at 1:01 AM, lostrain A 
donotlikeworkingh...@gmail.commailto:donotlikeworkingh...@gmail.com wrote:

Hi,
  I'm trying to save a simple dataframe to S3 in ORC format. The code is as 
follows:


 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
  import sqlContext.implicits._
  val df=sc.parallelize(1 to 1000).toDF()
  df.write.format(orc).save(s3://logs/dummy)

I ran the above code in spark-shell and only the _SUCCESS file was saved under 
the directory.
The last part of the spark-shell log said:

15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 
in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)

15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 
(save at console:29) finished in 0.834 s

15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, 
whose tasks have all completed, from pool

15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at console:29, 
took 0.895912 s

15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: 
Returning directory: /media/ephemeral0/s3/output-

15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is 
[-44, 29, -128, -39, -113, 0, -78,
 4, -23, -103, 9, -104, -20, -8, 66, 126]

15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed.

Anyone has experienced this before?
Thanks!







Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Zhan,
  Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
looks like this is most likely the reason. I'll verify this again once the
we make the upgrade.

Best,
los

On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang zzh...@hortonworks.com wrote:

 If you are using spark-1.4.0, probably it is caused by SPARK-8458
 https://issues.apache.org/jira/browse/SPARK-8458

 Thanks.

 Zhan Zhang

 On Aug 23, 2015, at 12:49 PM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Ted,
   Thanks for the suggestions. Actually I tried both s3n and s3 and the
 result remains the same.


 On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 In your case, I would specify fs.s3.awsAccessKeyId /
 fs.s3.awsSecretAccessKey since you use s3 protocol.

 On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
 donotlikeworkingh...@gmail.com wrote:

 Hi Ted,
   Thanks for the reply. I tried setting both of the keyid and accesskey
 via

 sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId, ***)
 sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey, **)


 However, the error still occurs for ORC format.

 If I change the format to JSON, although the error does not go, the JSON
 files can be saved successfully.




 On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A donotlikeworkingh...@gmail.com
 wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code
 is as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format(orc).save(s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was
 saved under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
 task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
 (100/100)



 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
 ResultStage 2 (save at console:29) finished in 0.834 s



 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool



 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 console:29, took 0.895912 s



 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-



 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]



 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.


 Anyone has experienced this before?
 Thanks!