Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Riccardo Ferrari
Hi Aakash,

Can you share how are you adding those jars? Are you using the package
method ? I assume you're running in a cluster, and those dependencies might
have not properly distributed.

How are you submitting your app? What kind of resource manager are you
using standalone, yarn, ...

Best,

On Fri, Dec 21, 2018 at 1:18 PM Aakash Basu 
wrote:

> Any help, anyone?
>
> On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu 
> wrote:
>
>> Hey Shuporno,
>>
>> With the updated config too, I am getting the same error. While trying to
>> figure that out, I found this link which says I need aws-java-sdk (which I
>> already have):
>> https://github.com/amazon-archives/kinesis-storm-spout/issues/8
>>
>> Now, this is my java details:
>>
>> java version "1.8.0_181"
>>
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>>
>> Is it due to some java version mismatch then or is it something else I am
>> missing out? What do you think?
>>
>> Thanks,
>> Aakash.
>>
>> On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
>> shuporno.choudh...@gmail.com> wrote:
>>
>>> Hi,
>>> I don't know whether the following config (that you have tried) are
>>> correct:
>>> fs.s3a.awsAccessKeyId
>>> fs.s3a.awsSecretAccessKey
>>>
>>> The correct ones probably are:
>>> fs.s3a.access.key
>>> fs.s3a.secret.key
>>>
>>> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List]
>>>  wrote:
>>>
 Hey Shuporno,

 Thanks for a prompt reply. Thanks for noticing the silly mistake, I
 tried this out, but still getting another error, which is related to
 connectivity it seems.

 >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
> >>> a =
> spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
> header=True)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
> : java.lang.NoClassDefFoundError:
> com/amazonaws/auth/AWSCredentialsProvider
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
> at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
> at
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundE

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Any help, anyone?

On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu 
wrote:

> Hey Shuporno,
>
> With the updated config too, I am getting the same error. While trying to
> figure that out, I found this link which says I need aws-java-sdk (which I
> already have):
> https://github.com/amazon-archives/kinesis-storm-spout/issues/8
>
> Now, this is my java details:
>
> java version "1.8.0_181"
>
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
>
> Is it due to some java version mismatch then or is it something else I am
> missing out? What do you think?
>
> Thanks,
> Aakash.
>
> On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
> shuporno.choudh...@gmail.com> wrote:
>
>> Hi,
>> I don't know whether the following config (that you have tried) are
>> correct:
>> fs.s3a.awsAccessKeyId
>> fs.s3a.awsSecretAccessKey
>>
>> The correct ones probably are:
>> fs.s3a.access.key
>> fs.s3a.secret.key
>>
>> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
>> ml+s1001560n34217...@n3.nabble.com> wrote:
>>
>>> Hey Shuporno,
>>>
>>> Thanks for a prompt reply. Thanks for noticing the silly mistake, I
>>> tried this out, but still getting another error, which is related to
>>> connectivity it seems.
>>>
>>> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
 >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
 >>> a =
 spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
 header=True)
 Traceback (most recent call last):
   File "", line 1, in 
   File
 "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
 line 441, in csv
 return
 self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
   File
 "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
   File
 "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
 line 63, in deco
 return f(*a, **kw)
   File
 "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 328, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
 : java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:348)
 at
 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
 at
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
 at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
 at
 org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
 at
 org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
 at
 org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.ClassNotFoundException:
 com.amazonaws.auth.AWSCredentialsProvider
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 28 more
>>>
>>>
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Hey Shuporno,

With the updated config too, I am getting the same error. While trying to
figure that out, I found this link which says I need aws-java-sdk (which I
already have):
https://github.com/amazon-archives/kinesis-storm-spout/issues/8

Now, this is my java details:

java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)



Is it due to some java version mismatch then or is it something else I am
missing out? What do you think?

Thanks,
Aakash.

On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
> I don't know whether the following config (that you have tried) are
> correct:
> fs.s3a.awsAccessKeyId
> fs.s3a.awsSecretAccessKey
>
> The correct ones probably are:
> fs.s3a.access.key
> fs.s3a.secret.key
>
> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
> ml+s1001560n34217...@n3.nabble.com> wrote:
>
>> Hey Shuporno,
>>
>> Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
>> this out, but still getting another error, which is related to connectivity
>> it seems.
>>
>> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>>> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>> line 63, in deco
>>> return f(*a, **kw)
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>>> : java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:348)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>> at
>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:282)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.amazonaws.auth.AWSCredentialsProvider
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 28 more
>>
>>
>>
>> Thanks,
>> Aakash.
>>
>> On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <[hidden email]
>> > wrote:
>>
>>>
>>>
>>> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <[hidden email]
>>> > wro

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Shuporno Choudhury
Hi,
I don't know whether the following config (that you have tried) are correct:
fs.s3a.awsAccessKeyId
fs.s3a.awsSecretAccessKey

The correct ones probably are:
fs.s3a.access.key
fs.s3a.secret.key

On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
ml+s1001560n34217...@n3.nabble.com> wrote:

> Hey Shuporno,
>
> Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
> this out, but still getting another error, which is related to connectivity
> it seems.
>
> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>> header=True)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>> line 441, in csv
>> return
>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>> line 1257, in __call__
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>> line 63, in deco
>> return f(*a, **kw)
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>> line 328, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>> : java.lang.NoClassDefFoundError:
>> com/amazonaws/auth/AWSCredentialsProvider
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>> at
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>> at
>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>> at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:282)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.ClassNotFoundException:
>> com.amazonaws.auth.AWSCredentialsProvider
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 28 more
>
>
>
> Thanks,
> Aakash.
>
> On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <[hidden email]
> > wrote:
>
>>
>>
>> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <[hidden email]
>> > wrote:
>>
>>> Hi,
>>> Your connection config uses 's3n' but your read command uses 's3a'.
>>> The config for s3a are:
>>> spark.hadoop.fs.s3a.access.key
>>> spark.hadoop.fs.s3a.secret.key
>>>
>>> I feel this should solve the problem.
>>>
>>> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List]
>>> <[hidden email] >
>>> wrote:
>>>
 Hi,

 I am trying to connect to AWS S3 and read a csv file (running POC) from
 a bucket.

 I have s3cmd and and being able to run ls and other operation from cli.

 *Present Configuration:*
 Python 3.7
 Spark 2.3.1

 *JARs added:*
 hadoop-aws-2.7.3.jar (in sync with the hadoop version used with s

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hey Shuporno,

Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
this out, but still getting another error, which is related to connectivity
it seems.

>>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
> header=True)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 28 more



Thanks,
Aakash.

On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

>
>
> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
> shuporno.choudh...@gmail.com> wrote:
>
>> Hi,
>> Your connection config uses 's3n' but your read command uses 's3a'.
>> The config for s3a are:
>> spark.hadoop.fs.s3a.access.key
>> spark.hadoop.fs.s3a.secret.key
>>
>> I feel this should solve the problem.
>>
>> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
>> ml+s1001560n34215...@n3.nabble.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to connect to AWS S3 and read a csv file (running POC) from
>>> a bucket.
>>>
>>> I have s3cmd and and being able to run ls and other operation from cli.
>>>
>>> *Present Configuration:*
>>> Python 3.7
>>> Spark 2.3.1
>>>
>>> *JARs added:*
>>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>>> aws-java-sdk-1.11.472.jar
>>>
>>> Trying out the following code:
>>>
>>> >>> sc=spark.sparkContext

 >>> hadoop_conf=sc._jsc.hadoopConfiguration()

 >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")

 >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")

 >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
 header=True)

 Traceback (most recent call last):

   File "", line 1, in 

   File
 "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/p

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Shuporno Choudhury
On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
> Your connection config uses 's3n' but your read command uses 's3a'.
> The config for s3a are:
> spark.hadoop.fs.s3a.access.key
> spark.hadoop.fs.s3a.secret.key
>
> I feel this should solve the problem.
>
> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
> ml+s1001560n34215...@n3.nabble.com> wrote:
>
>> Hi,
>>
>> I am trying to connect to AWS S3 and read a csv file (running POC) from a
>> bucket.
>>
>> I have s3cmd and and being able to run ls and other operation from cli.
>>
>> *Present Configuration:*
>> Python 3.7
>> Spark 2.3.1
>>
>> *JARs added:*
>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>> aws-java-sdk-1.11.472.jar
>>
>> Trying out the following code:
>>
>> >>> sc=spark.sparkContext
>>>
>>> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>>>
>>> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>>
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>> line 63, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>> line 328, in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>>>
>>> : java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider
>>>
>>> at java.lang.Class.forName0(Native Method)
>>>
>>> at java.lang.Class.forName(Class.java:348)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>
>>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>>
>>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>>
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>>
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>>
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>>
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>>
>>> at
>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>>
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>>
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>>
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>
>>> at py4j.Gateway.invoke(Gateway.java:282)
>>>
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>>
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>
>>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>>>
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.amazonaws.auth.AWSCredentialsProvider
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 28 more
>>>
>>>
>>> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>>
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq

Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hi,

I am trying to connect to AWS S3 and read a csv file (running POC) from a
bucket.

I have s3cmd and and being able to run ls and other operation from cli.

*Present Configuration:*
Python 3.7
Spark 2.3.1

*JARs added:*
hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
aws-java-sdk-1.11.472.jar

Trying out the following code:

>>> sc=spark.sparkContext
>
> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>
> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>
> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:282)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 28 more
>
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o67.csv.
>
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByN