PySpark tests are failed with the java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found
Hi Team, I am running the pyspark tests in Spark version and it failed with P*rovider org.apache.spark.sql.sources.FakeSourceOne not found.* Spark Version: 3.4.0/3.5.0 Python Version: 3.8.10 OS: Ubuntu 20.04 *Steps: * # /opt/data/spark/build/sbt -Phive clean package # /opt/data/spark/build/sbt test:compile # pip3 install -r /opt/data/spark/dev/requirements.txt # /opt/data/spark/python/run-tests --python-executables=python3 *Exception:* == ERROR [15.081s]: test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) -- Traceback (most recent call last): File "/opt/data/spark/python/pyspark/ml/tests/test_image.py", line 29, in test_read_images self.spark.read.format("image") File "/opt/data/spark/python/pyspark/sql/readwriter.py", line 300, in load return self._df(self._jreader.load(path)) File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py" , line 1322, in __call__ return_value = get_return_value( File "/opt/data/spark/python/pyspark/errors/exceptions/captured.py", line 176, in deco return f(*a, **kw) File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o33.load. : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found at java.util.ServiceLoader.fail(ServiceLoader.java:239) at java.util.ServiceLoader.access$300(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper .next(Wrappers.scala:46) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$ .lookupDataSource(DataSource.scala:629) at org.apache.spark.sql.execution.datasources.DataSource$ .lookupDataSourceV2(DataSource.scala:697) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Could someone help me how to proceed further? -- Thanks and Regards *Ranga Reddy* *--* *Bangalore, Karnataka, India* *Mobile : +91-9986183183 | Email: rangareddy.av...@gmail.com *
Re: StorageLevel: OFF_HEAP
Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com wrote: Hi, Ranga That's true. Typically a version mis-match issue. Note that spark 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to rebuild spark with your current tachyon release. We had used tachyon for several of our spark projects in a production environment. Thanks, Sun. -- fightf...@163.com *From:* Ranga sra...@gmail.com *Date:* 2015-03-18 06:45 *To:* user@spark.apache.org *Subject:* StorageLevel: OFF_HEAP Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga
Re: StorageLevel: OFF_HEAP
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver Thanks for any other pointers. - Ranga On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote: Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com wrote: Hi, Ranga That's true. Typically a version mis-match issue. Note that spark 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to rebuild spark with your current tachyon release. We had used tachyon for several of our spark projects in a production environment. Thanks, Sun. -- fightf...@163.com *From:* Ranga sra...@gmail.com *Date:* 2015-03-18 06:45 *To:* user@spark.apache.org *Subject:* StorageLevel: OFF_HEAP Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga
Re: StorageLevel: OFF_HEAP
Thanks Ted. Will do. On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Please apply the patch from: https://github.com/apache/spark/pull/4867 And rebuild Spark - the build would use Tachyon-0.6.1 Cheers On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote: Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0? Thanks for your help. - Ranga On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote: Did you recompile it with Tachyon 0.6.0? Also, Tachyon 0.6.1 has been released this morning: http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases Best regards, Haoyuan On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote: I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver Thanks for any other pointers. - Ranga On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote: Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com wrote: Hi, Ranga That's true. Typically a version mis-match issue. Note that spark 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to rebuild spark with your current tachyon release. We had used tachyon for several of our spark projects in a production environment. Thanks, Sun. -- fightf...@163.com *From:* Ranga sra...@gmail.com *Date:* 2015-03-18 06:45 *To:* user@spark.apache.org *Subject:* StorageLevel: OFF_HEAP Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: StorageLevel: OFF_HEAP
Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0? Thanks for your help. - Ranga On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote: Did you recompile it with Tachyon 0.6.0? Also, Tachyon 0.6.1 has been released this morning: http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases Best regards, Haoyuan On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote: I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver Thanks for any other pointers. - Ranga On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote: Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com wrote: Hi, Ranga That's true. Typically a version mis-match issue. Note that spark 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to rebuild spark with your current tachyon release. We had used tachyon for several of our spark projects in a production environment. Thanks, Sun. -- fightf...@163.com *From:* Ranga sra...@gmail.com *Date:* 2015-03-18 06:45 *To:* user@spark.apache.org *Subject:* StorageLevel: OFF_HEAP Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
StorageLevel: OFF_HEAP
Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga
Re: Spark Streaming: HiveContext within Custom Actor
Thanks. Will look at other options. On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor that acts as a receiver for a streaming application? If yes, what is the recommended way of passing the SparkContext to the actor? Thanks for your help. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDDs being cleaned too fast
I was having similar issues with my persistent RDDs. After some digging around, I noticed that the partitions were not balanced evenly across the available nodes. After a repartition, the RDD was spread evenly across all available memory. Not sure if that is something that would help your use-case though. You could also increase the spark.storage.memoryFraction if that is an option. - Ranga On Wed, Dec 10, 2014 at 10:23 PM, Aaron Davidson ilike...@gmail.com wrote: The ContextCleaner uncaches RDDs that have gone out of scope on the driver. So it's possible that the given RDD is no longer reachable in your program's control flow, or else it'd be a bug in the ContextCleaner. On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote: I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl is infinite, so I want more information about why this is happening and how to prevent it. Spark just logs this when removing RDDs: [2014-12-11 01:19:34,006] INFO spark.storage.BlockManager [] [] - Removing RDD 33 [2014-12-11 01:19:34,010] INFO pache.spark.ContextCleaner [] [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33 [2014-12-11 01:19:34,012] INFO spark.storage.BlockManager [] [] - Removing RDD 33 [2014-12-11 01:19:34,016] INFO pache.spark.ContextCleaner [] [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 Bucket Access
Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain / at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have access to S3 based on this role (I used s3ls and s3cp commands to verify this) - I can get a temporary access key-id/secret based on the IAMRole but they generally expire in an hour - If Spark is not able to use the IAMRole credentials, I may have to generate a static key-id/secret. This may or may not be possible in the environment I am in (from a policy perspective) - Ranga On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny m...@entropy.be wrote: Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set( AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set( AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: export AWS_ACCESS_KEY_ID=your access export AWS_SECRET_ACCESS_KEY=your secret 3. Set the access key and secret key inside the hadoop configurations val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey) hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey) 4. You can also try: val lines = s parkContext.textFile(s3n://yourAccessKey:yourSecretKey@ yourBucket/path/) Thanks Best Regards On Mon, Oct 13, 2014 at 11:33 PM, Ranga sra...@gmail.com wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga
Re: S3 Bucket Access
Thanks for the input. Yes, I did use the temporary access credentials provided by the IAM role (also detailed in the link you provided). The session token needs to be specified and I was looking for a way to set that in the header (which doesn't seem possible). Looks like a static key/secret is the only option. On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote: Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp command in hadoop. Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get the temporary access. If yes, this code cannot use directly by spark, for more information, you can take a look http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html sranga wrote Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain / at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have access to S3 based on this role (I used s3ls and s3cp commands to verify this) - I can get a temporary access key-id/secret based on the IAMRole but they generally expire in an hour - If Spark is not able to use the IAMRole credentials, I may have to generate a static key-id/secret. This may or may not be possible in the environment I am in (from a policy perspective) - Ranga On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt; mag@ gt; wrote: Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set( AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set( AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: export AWS_ACCESS_KEY_ID= your access export AWS_SECRET_ACCESS_KEY= your secret 3. Set the access key and secret key inside the hadoop configurations val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey) hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey) 4. You can also try: val lines = s parkContext.textFile(s3n://yourAccessKey:yourSecretKey@ yourBucket /path/) Thanks Best Regards On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt; sranga@ gt; wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 Bucket Access
One related question. Could I specify the com.amazonaws.services.s3.AmazonS3Client implementation for the fs.s3.impl parameter? Let me try that and update this thread with my findings. On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote: Thanks for the input. Yes, I did use the temporary access credentials provided by the IAM role (also detailed in the link you provided). The session token needs to be specified and I was looking for a way to set that in the header (which doesn't seem possible). Looks like a static key/secret is the only option. On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote: Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp command in hadoop. Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get the temporary access. If yes, this code cannot use directly by spark, for more information, you can take a look http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html sranga wrote Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain / at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have access to S3 based on this role (I used s3ls and s3cp commands to verify this) - I can get a temporary access key-id/secret based on the IAMRole but they generally expire in an hour - If Spark is not able to use the IAMRole credentials, I may have to generate a static key-id/secret. This may or may not be possible in the environment I am in (from a policy perspective) - Ranga On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt; mag@ gt; wrote: Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set( AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set( AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: export AWS_ACCESS_KEY_ID= your access export AWS_SECRET_ACCESS_KEY= your secret 3. Set the access key and secret key inside the hadoop configurations val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey) hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey) 4. You can also try: val lines = s parkContext.textFile(s3n://yourAccessKey:yourSecretKey@ yourBucket /path/) Thanks Best Regards On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt; sranga@ gt; wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3 Bucket Access
Thanks Rishi. That is exactly what I am trying to do now :) On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva rpi...@pivotal.io wrote: As per EMR documentation: http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html Access AWS Resources Using IAM Roles If you've launched your cluster with an IAM role, applications running on the EC2 instances of that cluster can use the IAM role to obtain temporary account credentials to use when calling services in AWS. The version of Hadoop available on AMI 2.3.0 and later has already been updated to make use of IAM roles. If your application runs strictly on top of the Hadoop architecture, and does not directly call any service in AWS, it should work with IAM roles with no modification. If your application calls services in AWS directly, you'll need to update it to take advantage of IAM roles. This means that instead of obtaining account credentials from/home/hadoop/conf/core-site.xml on the EC2 instances in the cluster, your application will now either use an SDK to access the resources using IAM roles, or call the EC2 instance metadata to obtain the temporary credentials. -- Maybe you can use AWS SDK in your application to provide AWS credentials? https://github.com/seratch/AWScala On Oct 14, 2014, at 11:10 AM, Ranga sra...@gmail.com wrote: One related question. Could I specify the com.amazonaws.services.s3.AmazonS3Client implementation for the fs.s3.impl parameter? Let me try that and update this thread with my findings. On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote: Thanks for the input. Yes, I did use the temporary access credentials provided by the IAM role (also detailed in the link you provided). The session token needs to be specified and I was looking for a way to set that in the header (which doesn't seem possible). Looks like a static key/secret is the only option. On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote: Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp command in hadoop. Do you use curl http://169.254.169.254/latest/meta-data/iam/. http://169.254.169.254/latest/meta-data/iam/.. to get the temporary access. If yes, this code cannot use directly by spark, for more information, you can take a look http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html sranga wrote Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain / at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have access to S3 based on this role (I used s3ls and s3cp commands to verify this) - I can get a temporary access key-id/secret based on the IAMRole but they generally expire in an hour - If Spark is not able to use the IAMRole credentials, I may have to generate a static key-id/secret. This may or may not be possible in the environment I am in (from a policy perspective) - Ranga On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt; mag@ gt; wrote: Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set( AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set( AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: export AWS_ACCESS_KEY_ID= your access export AWS_SECRET_ACCESS_KEY= your secret 3. Set the access key and secret key inside the hadoop configurations val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set(fs.s3.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey) hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey) 4. You can also try: val lines = s parkContext.textFile(s3n://yourAccessKey:yourSecretKey@ yourBucket /path/) Thanks Best Regards On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt; sranga@ gt; wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId
S3 Bucket Access
Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga
Re: S3 Bucket Access
Is there a way to specify a request header during the sparkContext.textFile call? - Ranga On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga
Re: S3 Bucket Access
Hi Daniil Could you provide some more details on how the cluster should be launched/configured? The EC2 instance that I am dealing with uses the concept of IAMRoles. I don't have any keyfile to specify to the spark-ec2 script. Thanks for your help. - Ranga On Mon, Oct 13, 2014 at 3:04 PM, Daniil Osipov daniil.osi...@shazam.com wrote: (Copying the user list) You should use spark_ec2 script to configure the cluster. If you use trunk version you can use the new --copy-aws-credentials option to configure the S3 parameters automatically, otherwise either include them in your SparkConf variable or add them to /root/spark/ephemeral-hdfs/conf/core-site.xml On Mon, Oct 13, 2014 at 2:56 PM, Ranga sra...@gmail.com wrote: The cluster is deployed on EC2 and I am trying to access the S3 files from within a spark-shell session. On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov daniil.osi...@shazam.com wrote: So is your cluster running on EC2, or locally? If you're running locally, you should still be able to access S3 files, you just need to locate the core-site.xml and add the parameters as defined in the error. On Mon, Oct 13, 2014 at 2:49 PM, Ranga sra...@gmail.com wrote: Hi Daniil No. I didn't create the spark-cluster using the ec2 scripts. Is that something that I need to do? I just downloaded Spark-1.1.0 and Hadoop-2.4. However, I am trying to access files on S3 from this cluster. - Ranga On Mon, Oct 13, 2014 at 2:36 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Did you add the fs.s3n.aws* configuration parameters in /root/spark/ephemeral-hdfs/conf/core-ste.xml? On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access the S3 buckets. Before setting the access key and secret the error was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). After setting the access key and secret, the error is: The AWS Access Key Id you provided does not exist in our records. The id/secret being set are the right values. This makes me believe that something else (token, etc.) needs to be set as well. Any help is appreciated. - Ranga
Re: Spark-SQL: SchemaRDD - ClassCastException
Resolution: After realizing that the SerDe (OpenCSV) was causing all the fields to be defined as String type, I modified the Hive load statement to use the default serializer. I was able to modify the CSV input file to use a different delimiter. Although, this is a workaround, I am able to proceed with this for now. - Ranga On Wed, Oct 8, 2014 at 9:18 PM, Ranga sra...@gmail.com wrote: This is a bit strange. When I print the schema for the RDD, it reflects the correct data type for each column. But doing any kind of mathematical calculation seems to result in ClassCastException. Here is a sample that results in the exception: select c1, c2 ... cast (c18 as int) * cast (c21 as int) ... from table Any other pointers? Thanks for the help. - Ranga On Wed, Oct 8, 2014 at 5:20 PM, Ranga sra...@gmail.com wrote: Sorry. Its 1.1.0. After digging a bit more into this, it seems like the OpenCSV Deseralizer converts all the columns to a String type. This maybe throwing the execution off. Planning to create a class and map the rows to this custom class. Will keep this thread updated. On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com wrote: Which version of Spark are you running? On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote: Thanks Michael. Should the cast be done in the source RDD or while doing the SUM? To give a better picture here is the code sequence: val sourceRdd = sql(select ... from source-hive-table) sourceRdd.registerAsTable(sourceRDD) val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2) // This query throws the exception when I collect the results I tried adding the cast to the aggRdd query above and that didn't help. - Ranga On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you running? This could be https://issues.apache.org/jira/browse/SPARK-1994 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote: Hi I am in the process of migrating some logic in pig scripts to Spark-SQL. As part of this process, I am creating a few Select...Group By query and registering them as tables using the SchemaRDD.registerAsTable feature. When using such a registered table in a subsequent Select...Group By query, I get a ClassCastException. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer This happens when I use the Sum function on one of the columns. Is there anyway to specify the data type for the columns when the registerAsTable function is called? Are there other approaches that I should be looking at? Thanks for your help. - Ranga
Spark-SQL: SchemaRDD - ClassCastException
Hi I am in the process of migrating some logic in pig scripts to Spark-SQL. As part of this process, I am creating a few Select...Group By query and registering them as tables using the SchemaRDD.registerAsTable feature. When using such a registered table in a subsequent Select...Group By query, I get a ClassCastException. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer This happens when I use the Sum function on one of the columns. Is there anyway to specify the data type for the columns when the registerAsTable function is called? Are there other approaches that I should be looking at? Thanks for your help. - Ranga
Re: Spark-SQL: SchemaRDD - ClassCastException
Thanks Michael. Should the cast be done in the source RDD or while doing the SUM? To give a better picture here is the code sequence: val sourceRdd = sql(select ... from source-hive-table) sourceRdd.registerAsTable(sourceRDD) val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2) // This query throws the exception when I collect the results I tried adding the cast to the aggRdd query above and that didn't help. - Ranga On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you running? This could be https://issues.apache.org/jira/browse/SPARK-1994 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote: Hi I am in the process of migrating some logic in pig scripts to Spark-SQL. As part of this process, I am creating a few Select...Group By query and registering them as tables using the SchemaRDD.registerAsTable feature. When using such a registered table in a subsequent Select...Group By query, I get a ClassCastException. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer This happens when I use the Sum function on one of the columns. Is there anyway to specify the data type for the columns when the registerAsTable function is called? Are there other approaches that I should be looking at? Thanks for your help. - Ranga
Re: Spark-SQL: SchemaRDD - ClassCastException
Sorry. Its 1.1.0. After digging a bit more into this, it seems like the OpenCSV Deseralizer converts all the columns to a String type. This maybe throwing the execution off. Planning to create a class and map the rows to this custom class. Will keep this thread updated. On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com wrote: Which version of Spark are you running? On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote: Thanks Michael. Should the cast be done in the source RDD or while doing the SUM? To give a better picture here is the code sequence: val sourceRdd = sql(select ... from source-hive-table) sourceRdd.registerAsTable(sourceRDD) val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2) // This query throws the exception when I collect the results I tried adding the cast to the aggRdd query above and that didn't help. - Ranga On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you running? This could be https://issues.apache.org/jira/browse/SPARK-1994 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote: Hi I am in the process of migrating some logic in pig scripts to Spark-SQL. As part of this process, I am creating a few Select...Group By query and registering them as tables using the SchemaRDD.registerAsTable feature. When using such a registered table in a subsequent Select...Group By query, I get a ClassCastException. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer This happens when I use the Sum function on one of the columns. Is there anyway to specify the data type for the columns when the registerAsTable function is called? Are there other approaches that I should be looking at? Thanks for your help. - Ranga
Re: Spark-SQL: SchemaRDD - ClassCastException
This is a bit strange. When I print the schema for the RDD, it reflects the correct data type for each column. But doing any kind of mathematical calculation seems to result in ClassCastException. Here is a sample that results in the exception: select c1, c2 ... cast (c18 as int) * cast (c21 as int) ... from table Any other pointers? Thanks for the help. - Ranga On Wed, Oct 8, 2014 at 5:20 PM, Ranga sra...@gmail.com wrote: Sorry. Its 1.1.0. After digging a bit more into this, it seems like the OpenCSV Deseralizer converts all the columns to a String type. This maybe throwing the execution off. Planning to create a class and map the rows to this custom class. Will keep this thread updated. On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com wrote: Which version of Spark are you running? On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote: Thanks Michael. Should the cast be done in the source RDD or while doing the SUM? To give a better picture here is the code sequence: val sourceRdd = sql(select ... from source-hive-table) sourceRdd.registerAsTable(sourceRDD) val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2) // This query throws the exception when I collect the results I tried adding the cast to the aggRdd query above and that didn't help. - Ranga On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you running? This could be https://issues.apache.org/jira/browse/SPARK-1994 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote: Hi I am in the process of migrating some logic in pig scripts to Spark-SQL. As part of this process, I am creating a few Select...Group By query and registering them as tables using the SchemaRDD.registerAsTable feature. When using such a registered table in a subsequent Select...Group By query, I get a ClassCastException. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer This happens when I use the Sum function on one of the columns. Is there anyway to specify the data type for the columns when the registerAsTable function is called? Are there other approaches that I should be looking at? Thanks for your help. - Ranga