Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
question,
it has discussion on various ways of accessing S3.

Thanks
Best Regards

On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote:

 Hi Guys,

 I think this problem is related to :

 http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html

 I am running pyspark 1.2.1 in AWS with my AWS credentials exported to
 master
 node as Environmental Variables.

 Halfway through my application, I get thrown with a
 org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
 S3 HEAD request failed for file path - ResponseCode=403,
 ResponseMessage=Forbidden

 Here is some important information about my job:
 + my AWS credentials exported to master node as Environmental Variables
 + there are no '/'s in my secret key
 + The earlier steps that uses this parquet file actually complete
 successsfully
 + The step before the count() does the following:
+ reads the parquet file (SELECT STATEMENT)
+ maps it to an RDD
+ runs a filter on the RDD
 + The filter works as follows:
+ extracts one field from each RDD line
+ checks with a list of 40,000 hashes for presence (if field in
 LIST_OF_HASHES.value)
+ LIST_OF_HASHES is a broadcast object

 The wierdness is that I am using this parquet file in earlier steps and it
 works fine. The other confusion I have is due to the fact that it only
 starts failing halfway through the stage. It completes a fraction of tasks
 and then starts failing..

 Hoping to hear something positive. Many thanks in advance

 Sahanbull

 The stack trace is as follows:
  negativeObs.count()
 [Stage 9:==   (161 + 240) /
 800]

 15/05/07 07:55:59 ERROR TaskSetManager: Task 277 in stage 9.0 failed 4
 times; aborting job
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/rdd.py, line 829, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /root/spark/python/pyspark/rdd.py, line 820, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /root/spark/python/pyspark/rdd.py, line 725, in reduce
 vals = self.mapPartitions(func).collect()
   File /root/spark/python/pyspark/rdd.py, line 686, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o139.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 277 in stage 9.0 failed 4 times, most recent failure: Lost task 277.3 in
 stage 9.0 (TID 4832, ip-172-31-1-185.ec2.internal):
 org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
 S3 HEAD request failed for

 '/subbucket%2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2Ffilename.parquet%2Fpart-r-349.parquet'
 - ResponseCode=403, ResponseMessage=Forbidden
 at

 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
 at sun.reflect.GeneratedMethodAccessor116.invoke(Unknown Source)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 at org.apache.hadoop.fs.s3native.$Proxy9.retrieveMetadata(Unknown
 Source)
 at

 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
 at
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
 at

 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
 at
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at 

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to
break.. I was running the filter on a SchemaRDD when I was supposed to be
running it on an RDD. 

But I still don't understand why the stderr was about S3 request rather than
a type checking error such as No tuple position 0 found in Row type was
thrown. The error was kinda misleading that I kindof oversaw this logical
error in my code. 

Just thought should keep this posted. 

-in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-fails-with-org-apache-hadoop-fs-s3-S3Exception-FORBIDDEN-tp22800p22815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org