PySpark tests are failed with the java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found

2023-04-12 Thread Ranga Reddy
Hi Team,

I am running the pyspark tests in Spark version and it failed with P*rovider
org.apache.spark.sql.sources.FakeSourceOne not found.*

Spark Version: 3.4.0/3.5.0
Python Version: 3.8.10
OS: Ubuntu 20.04


*Steps: *

# /opt/data/spark/build/sbt -Phive clean package
# /opt/data/spark/build/sbt test:compile
# pip3 install -r /opt/data/spark/dev/requirements.txt
# /opt/data/spark/python/run-tests --python-executables=python3

*Exception:*

==
ERROR [15.081s]: test_read_images
(pyspark.ml.tests.test_image.ImageFileFormatTest)
--
Traceback (most recent call last):
File "/opt/data/spark/python/pyspark/ml/tests/test_image.py", line 29, in
test_read_images
self.spark.read.format("image")
File "/opt/data/spark/python/pyspark/sql/readwriter.py", line 300, in load
return self._df(self._jreader.load(path))
File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py"
, line 1322, in __call__
return_value = get_return_value(
File "/opt/data/spark/python/pyspark/errors/exceptions/captured.py", line
176, in deco
return f(*a, **kw)
File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.sources.FakeSourceOne not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper
.next(Wrappers.scala:46)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource$
.lookupDataSource(DataSource.scala:629)
at org.apache.spark.sql.execution.datasources.DataSource$
.lookupDataSourceV2(DataSource.scala:697)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)


Could someone help me how to proceed further?


-- 
Thanks and Regards


*Ranga Reddy*
*--*

*Bangalore, Karnataka, India*
*Mobile : +91-9986183183 |  Email: rangareddy.av...@gmail.com
*


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks for the information. Will rebuild with 0.6.0 till the patch is
merged.

On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
 wrote:

 Hi, Ranga

 That's true. Typically a version mis-match issue. Note that spark 1.2.1
 has tachyon built in with version 0.5.0 , I think you may need to rebuild
 spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a production
 environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga





Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
issue. Here are the logs:
15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to
create tachyon dir in
/tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver

Thanks for any other pointers.


- Ranga



On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote:

 Thanks for the information. Will rebuild with 0.6.0 till the patch is
 merged.

 On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
 wrote:

 Hi, Ranga

 That's true. Typically a version mis-match issue. Note that spark 1.2.1
 has tachyon built in with version 0.5.0 , I think you may need to rebuild
 spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a production
 environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga






Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks Ted. Will do.

On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Please apply the patch from:
 https://github.com/apache/spark/pull/4867

 And rebuild Spark - the build would use Tachyon-0.6.1

 Cheers

 On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote:

 Hi Haoyuan

 No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
 not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
 Thanks for your help.


 - Ranga

 On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com
 wrote:

 Did you recompile it with Tachyon 0.6.0?

 Also, Tachyon 0.6.1 has been released this morning:
 http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases

 Best regards,

 Haoyuan

 On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote:

 I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
 issue. Here are the logs:
 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts
 to create tachyon dir in
 /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver

 Thanks for any other pointers.


 - Ranga



 On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote:

 Thanks for the information. Will rebuild with 0.6.0 till the patch is
 merged.

 On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
  wrote:

 Hi, Ranga

 That's true. Typically a version mis-match issue. Note that spark
 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to
 rebuild spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a
 production environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga







 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/






Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Hi Haoyuan

No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
Thanks for your help.


- Ranga

On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Did you recompile it with Tachyon 0.6.0?

 Also, Tachyon 0.6.1 has been released this morning:
 http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases

 Best regards,

 Haoyuan

 On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote:

 I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
 issue. Here are the logs:
 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts
 to create tachyon dir in
 /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver

 Thanks for any other pointers.


 - Ranga



 On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote:

 Thanks for the information. Will rebuild with 0.6.0 till the patch is
 merged.

 On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
 wrote:

 Hi, Ranga

 That's true. Typically a version mis-match issue. Note that spark
 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to
 rebuild spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a production
 environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga







 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/



StorageLevel: OFF_HEAP

2015-03-17 Thread Ranga
Hi

I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster.
The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when
I try to persist the RDD, I get the following error:

ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
TachyonFS.java[connect]:364)  - Invalid method name:
'getUserUnderfsTempFolder'
ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

Is this because of a version mis-match?

On a different note, I was wondering if Tachyon has been used in a
production environment by anybody in this group?

Appreciate your help with this.


- Ranga


Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Ranga
Thanks. Will look at other options.

On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 I am not sure that can be done. Receivers are designed to be run only
 on the executors/workers, whereas a SQLContext (for using Spark SQL)
 can only be defined on the driver.


 On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
  Hi
 
  Could Spark-SQL be used from within a custom actor that acts as a
 receiver
  for a streaming application? If yes, what is the recommended way of
 passing
  the SparkContext to the actor?
  Thanks for your help.
 
 
  - Ranga
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: RDDs being cleaned too fast

2014-12-11 Thread Ranga
I was having similar issues with my persistent RDDs. After some digging
around, I noticed that the partitions were not balanced evenly across the
available nodes. After a repartition, the RDD was spread evenly across
all available memory. Not sure if that is something that would help your
use-case though.
You could also increase the spark.storage.memoryFraction if that is an
option.


- Ranga

On Wed, Dec 10, 2014 at 10:23 PM, Aaron Davidson ilike...@gmail.com wrote:

 The ContextCleaner uncaches RDDs that have gone out of scope on the
 driver. So it's possible that the given RDD is no longer reachable in your
 program's control flow, or else it'd be a bug in the ContextCleaner.

 On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote:

 I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
 fast.
 How can i inspect the size of RDD in memory and get more information about
 why it was cleaned up. There should be more than enough memory available
 on
 the cluster to store them, and by default, the spark.cleaner.ttl is
 infinite, so I want more information about why this is happening and how
 to
 prevent it.

 Spark just logs this when removing RDDs:

 [2014-12-11 01:19:34,006] INFO  spark.storage.BlockManager [] [] -
 Removing
 RDD 33
 [2014-12-11 01:19:34,010] INFO  pache.spark.ContextCleaner []
 [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
 [2014-12-11 01:19:34,012] INFO  spark.storage.BlockManager [] [] -
 Removing
 RDD 33
 [2014-12-11 01:19:34,016] INFO  pache.spark.ContextCleaner []
 [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the pointers.
I verified that the access key-id/secret used are valid. However, the
secret may contain / at times. The issues I am facing are as follows:

   - The EC2 instances are setup with an IAMRole () and don't have a static
   key-id/secret
   - All of the EC2 instances have access to S3 based on this role (I used
   s3ls and s3cp commands to verify this)
   - I can get a temporary access key-id/secret based on the IAMRole but
   they generally expire in an hour
   - If Spark is not able to use the IAMRole credentials, I may have to
   generate a static key-id/secret. This may or may not be possible in the
   environment I am in (from a policy perspective)



- Ranga

On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny m...@entropy.be wrote:

 Hi,
 keep in mind that you're going to have a bad time if your secret key
 contains a /
 This is due to old and stupid hadoop bug:
 https://issues.apache.org/jira/browse/HADOOP-3733

 Best way is to regenerate the key so it does not include a /

 /Raf


 Akhil Das wrote:

 Try the following:

 1. Set the access key and secret key in the sparkContext:

 sparkContext.set(
 ​
 AWS_ACCESS_KEY_ID,yourAccessKey)

 sparkContext.set(
 ​
 AWS_SECRET_ACCESS_KEY,yourSecretKey)


 2. Set the access key and secret key in the environment before starting
 your application:

 ​

 export
 ​​
 AWS_ACCESS_KEY_ID=your access

 export
 ​​
 AWS_SECRET_ACCESS_KEY=your secret​


 3. Set the access key and secret key inside the hadoop configurations

 val hadoopConf=sparkContext.hadoopConfiguration;

 hadoopConf.set(fs.s3.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)

 hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)


 4. You can also try:

 val lines =

 ​s
 parkContext.textFile(s3n://yourAccessKey:yourSecretKey@
 yourBucket/path/)


 Thanks
 Best Regards

 On Mon, Oct 13, 2014 at 11:33 PM, Ranga sra...@gmail.com wrote:

 Hi

 I am trying to access files/buckets in S3 and encountering a permissions
 issue. The buckets are configured to authenticate using an IAMRole provider.
 I have set the KeyId and Secret using environment variables (
 AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable
 to access the S3 buckets.

 Before setting the access key and secret the error was: 
 java.lang.IllegalArgumentException:
 AWS Access Key ID and Secret Access Key must be specified as the username
 or password (respectively) of a s3n URL, or by setting the
 fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
 (respectively).

 After setting the access key and secret, the error is: The AWS Access
 Key Id you provided does not exist in our records.

 The id/secret being set are the right values. This makes me believe that
 something else (token, etc.) needs to be set as well.
 Any help is appreciated.


 - Ranga






Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the input.
Yes, I did use the temporary access credentials provided by the IAM role
(also detailed in the link you provided). The session token needs to be
specified and I was looking for a way to set that in the header (which
doesn't seem possible).
Looks like a static key/secret is the only option.

On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote:

 Hi,

 If I remember well, spark cannot use the IAMrole credentials to access to
 s3. It use at first the id/key in the environment. If it is null in the
 environment, it use the value in the file core-site.xml.  So, IAMrole is
 not
 useful for spark. The same problem happens if you want to use distcp
 command
 in hadoop.


 Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get the
 temporary access. If yes, this code cannot use directly by spark, for
 more
 information, you can take a look
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html



 sranga wrote
  Thanks for the pointers.
  I verified that the access key-id/secret used are valid. However, the
  secret may contain / at times. The issues I am facing are as follows:
 
 - The EC2 instances are setup with an IAMRole () and don't have a
  static
 key-id/secret
 - All of the EC2 instances have access to S3 based on this role (I
 used
 s3ls and s3cp commands to verify this)
 - I can get a temporary access key-id/secret based on the IAMRole
 but
 they generally expire in an hour
 - If Spark is not able to use the IAMRole credentials, I may have to
 generate a static key-id/secret. This may or may not be possible in
 the
 environment I am in (from a policy perspective)
 
 
 
  - Ranga
 
  On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt;

  mag@

  gt; wrote:
 
  Hi,
  keep in mind that you're going to have a bad time if your secret key
  contains a /
  This is due to old and stupid hadoop bug:
  https://issues.apache.org/jira/browse/HADOOP-3733
 
  Best way is to regenerate the key so it does not include a /
 
  /Raf
 
 
  Akhil Das wrote:
 
  Try the following:
 
  1. Set the access key and secret key in the sparkContext:
 
  sparkContext.set(
  ​
  AWS_ACCESS_KEY_ID,yourAccessKey)
 
  sparkContext.set(
  ​
  AWS_SECRET_ACCESS_KEY,yourSecretKey)
 
 
  2. Set the access key and secret key in the environment before starting
  your application:
 
  ​
 
  export
  ​​
  AWS_ACCESS_KEY_ID=
  your access
 
  export
  ​​
  AWS_SECRET_ACCESS_KEY=
  your secret
  ​
 
 
  3. Set the access key and secret key inside the hadoop configurations
 
  val hadoopConf=sparkContext.hadoopConfiguration;
 
  hadoopConf.set(fs.s3.impl,
  org.apache.hadoop.fs.s3native.NativeS3FileSystem)
 
  hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)
 
  hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)
 
 
  4. You can also try:
 
  val lines =
 
  ​s
  parkContext.textFile(s3n://yourAccessKey:yourSecretKey@
 
  yourBucket
  /path/)
 
 
  Thanks
  Best Regards
 
  On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt;

  sranga@

  gt; wrote:
 
  Hi
 
  I am trying to access files/buckets in S3 and encountering a
 permissions
  issue. The buckets are configured to authenticate using an IAMRole
  provider.
  I have set the KeyId and Secret using environment variables (
  AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
 unable
  to access the S3 buckets.
 
  Before setting the access key and secret the error was:
  java.lang.IllegalArgumentException:
  AWS Access Key ID and Secret Access Key must be specified as the
  username
  or password (respectively) of a s3n URL, or by setting the
  fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
  (respectively).
 
  After setting the access key and secret, the error is: The AWS Access
  Key Id you provided does not exist in our records.
 
  The id/secret being set are the right values. This makes me believe
 that
  something else (token, etc.) needs to be set as well.
  Any help is appreciated.
 
 
  - Ranga
 
 
 
 





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: S3 Bucket Access

2014-10-14 Thread Ranga
One related question. Could I specify the 
com.amazonaws.services.s3.AmazonS3Client implementation for the  
fs.s3.impl parameter? Let me try that and update this thread with my
findings.

On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote:

 Thanks for the input.
 Yes, I did use the temporary access credentials provided by the IAM role
 (also detailed in the link you provided). The session token needs to be
 specified and I was looking for a way to set that in the header (which
 doesn't seem possible).
 Looks like a static key/secret is the only option.

 On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote:

 Hi,

 If I remember well, spark cannot use the IAMrole credentials to access to
 s3. It use at first the id/key in the environment. If it is null in the
 environment, it use the value in the file core-site.xml.  So, IAMrole is
 not
 useful for spark. The same problem happens if you want to use distcp
 command
 in hadoop.


 Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get
 the
 temporary access. If yes, this code cannot use directly by spark, for
 more
 information, you can take a look
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html



 sranga wrote
  Thanks for the pointers.
  I verified that the access key-id/secret used are valid. However, the
  secret may contain / at times. The issues I am facing are as follows:
 
 - The EC2 instances are setup with an IAMRole () and don't have a
  static
 key-id/secret
 - All of the EC2 instances have access to S3 based on this role (I
 used
 s3ls and s3cp commands to verify this)
 - I can get a temporary access key-id/secret based on the IAMRole
 but
 they generally expire in an hour
 - If Spark is not able to use the IAMRole credentials, I may have to
 generate a static key-id/secret. This may or may not be possible in
 the
 environment I am in (from a policy perspective)
 
 
 
  - Ranga
 
  On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt;

  mag@

  gt; wrote:
 
  Hi,
  keep in mind that you're going to have a bad time if your secret key
  contains a /
  This is due to old and stupid hadoop bug:
  https://issues.apache.org/jira/browse/HADOOP-3733
 
  Best way is to regenerate the key so it does not include a /
 
  /Raf
 
 
  Akhil Das wrote:
 
  Try the following:
 
  1. Set the access key and secret key in the sparkContext:
 
  sparkContext.set(
  ​
  AWS_ACCESS_KEY_ID,yourAccessKey)
 
  sparkContext.set(
  ​
  AWS_SECRET_ACCESS_KEY,yourSecretKey)
 
 
  2. Set the access key and secret key in the environment before starting
  your application:
 
  ​
 
  export
  ​​
  AWS_ACCESS_KEY_ID=
  your access
 
  export
  ​​
  AWS_SECRET_ACCESS_KEY=
  your secret
  ​
 
 
  3. Set the access key and secret key inside the hadoop configurations
 
  val hadoopConf=sparkContext.hadoopConfiguration;
 
  hadoopConf.set(fs.s3.impl,
  org.apache.hadoop.fs.s3native.NativeS3FileSystem)
 
  hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)
 
  hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)
 
 
  4. You can also try:
 
  val lines =
 
  ​s
  parkContext.textFile(s3n://yourAccessKey:yourSecretKey@
 
  yourBucket
  /path/)
 
 
  Thanks
  Best Regards
 
  On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt;

  sranga@

  gt; wrote:
 
  Hi
 
  I am trying to access files/buckets in S3 and encountering a
 permissions
  issue. The buckets are configured to authenticate using an IAMRole
  provider.
  I have set the KeyId and Secret using environment variables (
  AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
 unable
  to access the S3 buckets.
 
  Before setting the access key and secret the error was:
  java.lang.IllegalArgumentException:
  AWS Access Key ID and Secret Access Key must be specified as the
  username
  or password (respectively) of a s3n URL, or by setting the
  fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
  (respectively).
 
  After setting the access key and secret, the error is: The AWS Access
  Key Id you provided does not exist in our records.
 
  The id/secret being set are the right values. This makes me believe
 that
  something else (token, etc.) needs to be set as well.
  Any help is appreciated.
 
 
  - Ranga
 
 
 
 





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks Rishi. That is exactly what I am trying to do now :)

On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva rpi...@pivotal.io wrote:


 As per EMR documentation:
 http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html
 Access AWS Resources Using IAM Roles

 If you've launched your cluster with an IAM role, applications running on
 the EC2 instances of that cluster can use the IAM role to obtain temporary
 account credentials to use when calling services in AWS.

 The version of Hadoop available on AMI 2.3.0 and later has already been
 updated to make use of IAM roles. If your application runs strictly on top
 of the Hadoop architecture, and does not directly call any service in AWS,
 it should work with IAM roles with no modification.

 If your application calls services in AWS directly, you'll need to update
 it to take advantage of IAM roles. This means that instead of obtaining
 account credentials from/home/hadoop/conf/core-site.xml on the EC2
 instances in the cluster, your application will now either use an SDK to
 access the resources using IAM roles, or call the EC2 instance metadata to
 obtain the temporary credentials.
 --

 Maybe you can use AWS SDK in your application to provide AWS credentials?

 https://github.com/seratch/AWScala


 On Oct 14, 2014, at 11:10 AM, Ranga sra...@gmail.com wrote:

 One related question. Could I specify the 
 com.amazonaws.services.s3.AmazonS3Client implementation for the  
 fs.s3.impl parameter? Let me try that and update this thread with my
 findings.

 On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote:

 Thanks for the input.
 Yes, I did use the temporary access credentials provided by the IAM
 role (also detailed in the link you provided). The session token needs to
 be specified and I was looking for a way to set that in the header (which
 doesn't seem possible).
 Looks like a static key/secret is the only option.

 On Tue, Oct 14, 2014 at 10:32 AM, Gen gen.tan...@gmail.com wrote:

 Hi,

 If I remember well, spark cannot use the IAMrole credentials to access to
 s3. It use at first the id/key in the environment. If it is null in the
 environment, it use the value in the file core-site.xml.  So, IAMrole is
 not
 useful for spark. The same problem happens if you want to use distcp
 command
 in hadoop.


 Do you use curl http://169.254.169.254/latest/meta-data/iam/.
 http://169.254.169.254/latest/meta-data/iam/.. to get the
 temporary access. If yes, this code cannot use directly by spark, for
 more
 information, you can take a look
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
 http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html



 sranga wrote
  Thanks for the pointers.
  I verified that the access key-id/secret used are valid. However, the
  secret may contain / at times. The issues I am facing are as follows:
 
 - The EC2 instances are setup with an IAMRole () and don't have a
  static
 key-id/secret
 - All of the EC2 instances have access to S3 based on this role (I
 used
 s3ls and s3cp commands to verify this)
 - I can get a temporary access key-id/secret based on the IAMRole
 but
 they generally expire in an hour
 - If Spark is not able to use the IAMRole credentials, I may have to
 generate a static key-id/secret. This may or may not be possible in
 the
 environment I am in (from a policy perspective)
 
 
 
  - Ranga
 
  On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt;

  mag@

  gt; wrote:
 
  Hi,
  keep in mind that you're going to have a bad time if your secret key
  contains a /
  This is due to old and stupid hadoop bug:
  https://issues.apache.org/jira/browse/HADOOP-3733
 
  Best way is to regenerate the key so it does not include a /
 
  /Raf
 
 
  Akhil Das wrote:
 
  Try the following:
 
  1. Set the access key and secret key in the sparkContext:
 
  sparkContext.set(
  ​
  AWS_ACCESS_KEY_ID,yourAccessKey)
 
  sparkContext.set(
  ​
  AWS_SECRET_ACCESS_KEY,yourSecretKey)
 
 
  2. Set the access key and secret key in the environment before
 starting
  your application:
 
  ​
 
  export
  ​​
  AWS_ACCESS_KEY_ID=
  your access
 
  export
  ​​
  AWS_SECRET_ACCESS_KEY=
  your secret
  ​
 
 
  3. Set the access key and secret key inside the hadoop configurations
 
  val hadoopConf=sparkContext.hadoopConfiguration;
 
  hadoopConf.set(fs.s3.impl,
  org.apache.hadoop.fs.s3native.NativeS3FileSystem)
 
  hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)
 
  hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)
 
 
  4. You can also try:
 
  val lines =
 
  ​s
  parkContext.textFile(s3n://yourAccessKey:yourSecretKey@
 
  yourBucket
  /path/)
 
 
  Thanks
  Best Regards
 
  On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt;

  sranga@

  gt; wrote:
 
  Hi
 
  I am trying to access files/buckets in S3 and encountering a
 permissions
  issue. The buckets are configured to authenticate using an IAMRole
  provider.
  I have set the KeyId

S3 Bucket Access

2014-10-13 Thread Ranga
Hi

I am trying to access files/buckets in S3 and encountering a permissions
issue. The buckets are configured to authenticate using an IAMRole provider.
I have set the KeyId and Secret using environment variables (
AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to
access the S3 buckets.

Before setting the access key and secret the error was:
java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the username
or password (respectively) of a s3n URL, or by setting the
fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
(respectively).

After setting the access key and secret, the error is: The AWS Access Key
Id you provided does not exist in our records.

The id/secret being set are the right values. This makes me believe that
something else (token, etc.) needs to be set as well.
Any help is appreciated.


- Ranga


Re: S3 Bucket Access

2014-10-13 Thread Ranga
Is there a way to specify a request header during the
sparkContext.textFile call?


- Ranga

On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote:

 Hi

 I am trying to access files/buckets in S3 and encountering a permissions
 issue. The buckets are configured to authenticate using an IAMRole provider.
 I have set the KeyId and Secret using environment variables (
 AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable
 to access the S3 buckets.

 Before setting the access key and secret the error was: 
 java.lang.IllegalArgumentException:
 AWS Access Key ID and Secret Access Key must be specified as the username
 or password (respectively) of a s3n URL, or by setting the
 fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
 (respectively).

 After setting the access key and secret, the error is: The AWS Access
 Key Id you provided does not exist in our records.

 The id/secret being set are the right values. This makes me believe that
 something else (token, etc.) needs to be set as well.
 Any help is appreciated.


 - Ranga



Re: S3 Bucket Access

2014-10-13 Thread Ranga
Hi Daniil

Could you provide some more details on how the cluster should be
launched/configured? The EC2 instance that I am dealing with uses the
concept of IAMRoles. I don't have any keyfile to specify to the spark-ec2
script.
Thanks for your help.


- Ranga

On Mon, Oct 13, 2014 at 3:04 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 (Copying the user list)
 You should use spark_ec2 script to configure the cluster. If you use trunk
 version you can use the new --copy-aws-credentials option to configure the
 S3 parameters automatically, otherwise either include them in your
 SparkConf variable or add them to
 /root/spark/ephemeral-hdfs/conf/core-site.xml

 On Mon, Oct 13, 2014 at 2:56 PM, Ranga sra...@gmail.com wrote:

 The cluster is deployed on EC2 and I am trying to access the S3 files
 from within a spark-shell session.

 On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov daniil.osi...@shazam.com
 wrote:

 So is your cluster running on EC2, or locally? If you're running
 locally, you should still be able to access S3 files, you just need to
 locate the core-site.xml and add the parameters as defined in the error.

 On Mon, Oct 13, 2014 at 2:49 PM, Ranga sra...@gmail.com wrote:

 Hi Daniil

 No. I didn't create the spark-cluster using the ec2 scripts. Is that
 something that I need to do? I just downloaded Spark-1.1.0 and Hadoop-2.4.
 However, I am trying to access files on S3 from this cluster.


 - Ranga

 On Mon, Oct 13, 2014 at 2:36 PM, Daniil Osipov 
 daniil.osi...@shazam.com wrote:

 Did you add the fs.s3n.aws* configuration parameters in
 /root/spark/ephemeral-hdfs/conf/core-ste.xml?

 On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote:

 Hi

 I am trying to access files/buckets in S3 and encountering a
 permissions issue. The buckets are configured to authenticate using an
 IAMRole provider.
 I have set the KeyId and Secret using environment variables (
 AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
 unable to access the S3 buckets.

 Before setting the access key and secret the error was: 
 java.lang.IllegalArgumentException:
 AWS Access Key ID and Secret Access Key must be specified as the username
 or password (respectively) of a s3n URL, or by setting the
 fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
 (respectively).

 After setting the access key and secret, the error is: The AWS
 Access Key Id you provided does not exist in our records.

 The id/secret being set are the right values. This makes me believe
 that something else (token, etc.) needs to be set as well.
 Any help is appreciated.


 - Ranga









Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-09 Thread Ranga
Resolution:
After realizing that the SerDe (OpenCSV) was causing all the fields to be
defined as String type, I modified the Hive load statement to use the
default serializer. I was able to modify the CSV input file to use a
different delimiter. Although, this is a workaround, I am able to proceed
with this for now.


- Ranga

On Wed, Oct 8, 2014 at 9:18 PM, Ranga sra...@gmail.com wrote:

 This is a bit strange. When I print the schema for the RDD, it reflects
 the correct data type for each column. But doing any kind of mathematical
 calculation seems to result in ClassCastException. Here is a sample that
 results in the exception:
 select c1, c2
 ...
 cast (c18 as int) * cast (c21 as int)
 ...
 from table

 Any other pointers? Thanks for the help.


 - Ranga

 On Wed, Oct 8, 2014 at 5:20 PM, Ranga sra...@gmail.com wrote:

 Sorry. Its 1.1.0.
 After digging a bit more into this, it seems like the OpenCSV Deseralizer
 converts all the columns to a String type. This maybe throwing the
 execution off. Planning to create a class and map the rows to this custom
 class. Will keep this thread updated.

 On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Which version of Spark are you running?

 On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote:

 Thanks Michael. Should the cast be done in the source RDD or while
 doing the SUM?
 To give a better picture here is the code sequence:

 val sourceRdd = sql(select ... from source-hive-table)
 sourceRdd.registerAsTable(sourceRDD)
 val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1,
 c2)  // This query throws the exception when I collect the results

 I tried adding the cast to the aggRdd query above and that didn't help.


 - Ranga

 On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 Using SUM on a string should automatically cast the column.  Also you
 can use CAST to change the datatype
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
 .

 What version of Spark are you running?  This could be
 https://issues.apache.org/jira/browse/SPARK-1994

 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote:

 Hi

 I am in the process of migrating some logic in pig scripts to
 Spark-SQL. As part of this process, I am creating a few Select...Group 
 By
 query and registering them as tables using the SchemaRDD.registerAsTable
 feature.
 When using such a registered table in a subsequent Select...Group
 By query, I get a ClassCastException.
 java.lang.ClassCastException: java.lang.String cannot be cast to
 java.lang.Integer

 This happens when I use the Sum function on one of the columns. Is
 there anyway to specify the data type for the columns when the
 registerAsTable function is called? Are there other approaches that I
 should be looking at?

 Thanks for your help.



 - Ranga









Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Hi

I am in the process of migrating some logic in pig scripts to Spark-SQL. As
part of this process, I am creating a few Select...Group By query and
registering them as tables using the SchemaRDD.registerAsTable feature.
When using such a registered table in a subsequent Select...Group By
query, I get a ClassCastException.
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer

This happens when I use the Sum function on one of the columns. Is there
anyway to specify the data type for the columns when the registerAsTable
function is called? Are there other approaches that I should be looking at?

Thanks for your help.



- Ranga


Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:

val sourceRdd = sql(select ... from source-hive-table)
sourceRdd.registerAsTable(sourceRDD)
val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2)
 // This query throws the exception when I collect the results

I tried adding the cast to the aggRdd query above and that didn't help.


- Ranga

On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com
wrote:

 Using SUM on a string should automatically cast the column.  Also you can
 use CAST to change the datatype
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
 .

 What version of Spark are you running?  This could be
 https://issues.apache.org/jira/browse/SPARK-1994

 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote:

 Hi

 I am in the process of migrating some logic in pig scripts to Spark-SQL.
 As part of this process, I am creating a few Select...Group By query and
 registering them as tables using the SchemaRDD.registerAsTable feature.
 When using such a registered table in a subsequent Select...Group By
 query, I get a ClassCastException.
 java.lang.ClassCastException: java.lang.String cannot be cast to
 java.lang.Integer

 This happens when I use the Sum function on one of the columns. Is
 there anyway to specify the data type for the columns when the
 registerAsTable function is called? Are there other approaches that I
 should be looking at?

 Thanks for your help.



 - Ranga





Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Sorry. Its 1.1.0.
After digging a bit more into this, it seems like the OpenCSV Deseralizer
converts all the columns to a String type. This maybe throwing the
execution off. Planning to create a class and map the rows to this custom
class. Will keep this thread updated.

On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com
wrote:

 Which version of Spark are you running?

 On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote:

 Thanks Michael. Should the cast be done in the source RDD or while doing
 the SUM?
 To give a better picture here is the code sequence:

 val sourceRdd = sql(select ... from source-hive-table)
 sourceRdd.registerAsTable(sourceRDD)
 val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2)
  // This query throws the exception when I collect the results

 I tried adding the cast to the aggRdd query above and that didn't help.


 - Ranga

 On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Using SUM on a string should automatically cast the column.  Also you
 can use CAST to change the datatype
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
 .

 What version of Spark are you running?  This could be
 https://issues.apache.org/jira/browse/SPARK-1994

 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote:

 Hi

 I am in the process of migrating some logic in pig scripts to
 Spark-SQL. As part of this process, I am creating a few Select...Group By
 query and registering them as tables using the SchemaRDD.registerAsTable
 feature.
 When using such a registered table in a subsequent Select...Group By
 query, I get a ClassCastException.
 java.lang.ClassCastException: java.lang.String cannot be cast to
 java.lang.Integer

 This happens when I use the Sum function on one of the columns. Is
 there anyway to specify the data type for the columns when the
 registerAsTable function is called? Are there other approaches that I
 should be looking at?

 Thanks for your help.



 - Ranga







Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
This is a bit strange. When I print the schema for the RDD, it reflects the
correct data type for each column. But doing any kind of mathematical
calculation seems to result in ClassCastException. Here is a sample that
results in the exception:
select c1, c2
...
cast (c18 as int) * cast (c21 as int)
...
from table

Any other pointers? Thanks for the help.


- Ranga

On Wed, Oct 8, 2014 at 5:20 PM, Ranga sra...@gmail.com wrote:

 Sorry. Its 1.1.0.
 After digging a bit more into this, it seems like the OpenCSV Deseralizer
 converts all the columns to a String type. This maybe throwing the
 execution off. Planning to create a class and map the rows to this custom
 class. Will keep this thread updated.

 On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Which version of Spark are you running?

 On Wed, Oct 8, 2014 at 4:18 PM, Ranga sra...@gmail.com wrote:

 Thanks Michael. Should the cast be done in the source RDD or while doing
 the SUM?
 To give a better picture here is the code sequence:

 val sourceRdd = sql(select ... from source-hive-table)
 sourceRdd.registerAsTable(sourceRDD)
 val aggRdd = sql(select c1, c2, sum(c3) from sourceRDD group by c1, c2)
  // This query throws the exception when I collect the results

 I tried adding the cast to the aggRdd query above and that didn't help.


 - Ranga

 On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com
  wrote:

 Using SUM on a string should automatically cast the column.  Also you
 can use CAST to change the datatype
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
 .

 What version of Spark are you running?  This could be
 https://issues.apache.org/jira/browse/SPARK-1994

 On Wed, Oct 8, 2014 at 3:47 PM, Ranga sra...@gmail.com wrote:

 Hi

 I am in the process of migrating some logic in pig scripts to
 Spark-SQL. As part of this process, I am creating a few Select...Group 
 By
 query and registering them as tables using the SchemaRDD.registerAsTable
 feature.
 When using such a registered table in a subsequent Select...Group By
 query, I get a ClassCastException.
 java.lang.ClassCastException: java.lang.String cannot be cast to
 java.lang.Integer

 This happens when I use the Sum function on one of the columns. Is
 there anyway to specify the data type for the columns when the
 registerAsTable function is called? Are there other approaches that I
 should be looking at?

 Thanks for your help.



 - Ranga