[jira] [Updated] (SPARK-5866) pyspark read from s3

Sean Owen (JIRA) Tue, 17 Feb 2015 11:28:49 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-5866:
-----------------------------
    Priority: Major  (was: Blocker)

The immediate error is:

{code}
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: s3://bucketName/pathS3/1111_1417479684
{code}

Please first verify this even exists.

> pyspark read from s3
> --------------------
>
>                 Key: SPARK-5866
>                 URL: https://issues.apache.org/jira/browse/SPARK-5866
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.1
>         Environment: mac OSx and ec2 ubuntu
>            Reporter: venu k tangirala
>
> I am trying to read data from s3 via pyspark, I gave the credentials with 
> sc= SparkContext()
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "key")
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "secret_key")
> I also tried setting the credentials with core-site.xml and placed in the 
> conf/ dir. 
> Interestingly, the same works with scala version of spark, both by setting 
> the s3 key and secret key in scala code and also by setting it in 
> core-site.xml
> The pySpark error is as follows :
> File "/Users/myname/leeo/path/./spark_json.py", line 55, in <module>
>     vals_table = sqlContext.inferSchema(values)
>   File "/Users/myname/spark-1.2.1/python/pyspark/sql.py", line 1332, in 
> inferSchema
>     first = rdd.first()
>   File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first
>     rs = self.take(1)
>   File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
>     totalParts = self._jrdd.partitions().size()
>   File 
> "/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py",
>  line 538, in __call__
>     self.target_id, self.name)
>   File 
> "/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py",
>  line 300, in get_return_value
>     format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
> : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
> does not exist: s3://bucketName/pathS3/1111_1417479684
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
>       at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
>       at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
>       at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
>       at 
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
>       at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>       at py4j.Gateway.invoke(Gateway.java:259)
>       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>       at py4j.commands.CallCommand.execute(CallCommand.java:79)
>       at py4j.GatewayConnection.run(GatewayConnection.java:207)
>       at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5866) pyspark read from s3

Reply via email to