Re: How to access Spark UI through AWS

2015-08-25 Thread Kelly, Jonathan
I'm not sure why the UI appears broken like that either and haven't
investigated it myself yet, but if you instead go to the YARN
ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x,
I believe), then you should be able to click on the ApplicationMaster link
(or the History link for completed applications) to get to the Spark UI
from there. The ApplicationMaster link will use the YARN Proxy Service
(port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark
application's UI, regardless of what port it's running on. For completed
applications, the History link will send you directly to the Spark History
Server UI on port 18080. Hope that helps!

~ Jonathan




On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:

I am using the steps from  this article
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923   to
get spark up and running on EMR through yarn. Once up and running I ssh in
and cd to the spark bin and run spark-shell --master yarn. Once this spins
up I can see that the UI is started at the internal ip of 4040. If I hit
the
public dns at 4040 with dynamic port tunneling and foxyproxy then I get a
crude UI (css seems broken), however the proxy continuously redirects me
to
the main page, so I cannot drill into anything. So, I tried static
tunneling, but can't seem to get through.

So, how can I access the spark UI when running a spark shell in AWS yarn?



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI
-through-AWS-tp24436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
Would there be any problem in having spark.executor.instances (or 
--num-executors) be completely ignored (i.e., even for non-zero values) if 
spark.dynamicAllocation.enabled is true (i.e., rather than throwing an 
exception)?

I can see how the exception would be helpful if, say, you tried to pass both 
-c spark.executor.instances (or --num-executors) *and* -c 
spark.dynamicAllocation.enabled=true to spark-submit on the command line (as 
opposed to having one of them in spark-defaults.conf and one of them in the 
spark-submit args), but currently there doesn't seem to be any way to 
distinguish between arguments that were actually passed to spark-submit and 
settings that simply came from spark-defaults.conf.

If there were a way to distinguish them, I think the ideal situation would be 
for the validation exception to be thrown only if spark.executor.instances and 
spark.dynamicAllocation.enabled=true were both passed via spark-submit args or 
were both present in spark-defaults.conf, but passing 
spark.dynamicAllocation.enabled=true to spark-submit would take precedence over 
spark.executor.instances configured in spark-defaults.conf, and vice versa.

Jonathan Kelly
Elastic MapReduce - SDE
Blackfoot (SEA33) 06.850.F0

From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Tuesday, July 14, 2015 at 4:23 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Unable to use dynamicAllocation if spark.executor.instances is set in 
spark-defaults.conf

I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
bump

From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Tuesday, July 14, 2015 at 4:23 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Unable to use dynamicAllocation if spark.executor.instances is set in 
spark-defaults.conf

I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-14 Thread Kelly, Jonathan
I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Re: Spark ec2 cluster lost worker

2015-06-24 Thread Kelly, Jonathan
Yeah, sorry, I didn't really answer your question due to my bias for EMR. =P 
Unfortunately, also due to my bias, I have not actually tried using a straight 
EC2 cluster as opposed to Spark on EMR. I'm not sure how the slave nodes get 
set up when running Spark on EC2, but I would imagine that it must set them up 
with SSH keys so that the master can connect to them without a password. So are 
you able to SSH to the master node then from there SSH to the slave node? If 
so, maybe you can find out what's going on with that slave node and maybe 
restart the Spark worker?

~ Jonathan Kelly

From: Anny Chen anny9...@gmail.commailto:anny9...@gmail.com
Date: Wednesday, June 24, 2015 at 6:15 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark ec2 cluster lost worker

Hi Jonathan,

Thanks for this information! I will take a look into it. However is there a way 
to reconnect the lost node? Or there's no way that I could do to find back the 
lost worker?

Thanks!
Anny

On Wed, Jun 24, 2015 at 6:06 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
Just curious, would you be able to use Spark on EMR rather than on EC2?
Spark on EMR will handle lost nodes for you, and it will let you scale
your cluster up and down or clone a cluster (its config, that is, not the
data stored in HDFS), among other things. We also recently announced
official support for Spark on EMR: http://aws.amazon.com/emr/spark

~ Jonathan Kelly (from Amazon AWS EMR)


On 6/24/15, 5:58 PM, anny9699 anny9...@gmail.commailto:anny9...@gmail.com 
wrote:

Hi,

According to the Spark UI, one worker is lost after a failed job. It is
not
a lost executor error, but that the UI now only shows 8 workers (I have
9
workers). However from the ec2 console, it shows the machine is running
and no check alarms. So I am confused how I could reconnect the lost
machine
in aws ec2?

I met this problem before, and my solution was to rebuilt a new cluster.
However now it is a little hard to rebuild a cluster, so I am wondering if
there's some way to find back the lost machine?

Thanks a lot!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost
-worker-tp23482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





Re: Spark ec2 cluster lost worker

2015-06-24 Thread Kelly, Jonathan
Just curious, would you be able to use Spark on EMR rather than on EC2?
Spark on EMR will handle lost nodes for you, and it will let you scale
your cluster up and down or clone a cluster (its config, that is, not the
data stored in HDFS), among other things. We also recently announced
official support for Spark on EMR: http://aws.amazon.com/emr/spark

~ Jonathan Kelly (from Amazon AWS EMR)


On 6/24/15, 5:58 PM, anny9699 anny9...@gmail.com wrote:

Hi,

According to the Spark UI, one worker is lost after a failed job. It is
not
a lost executor error, but that the UI now only shows 8 workers (I have
9
workers). However from the ec2 console, it shows the machine is running
and no check alarms. So I am confused how I could reconnect the lost
machine
in aws ec2? 

I met this problem before, and my solution was to rebuilt a new cluster.
However now it is a little hard to rebuild a cluster, so I am wondering if
there's some way to find back the lost machine?

Thanks a lot!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost
-worker-tp23482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [ERROR] Insufficient Space

2015-06-19 Thread Kelly, Jonathan
Would you be able to use Spark on EMR rather than on EC2? EMR clusters allow 
easy resizing of the cluster, and EMR also now supports Spark 1.3.1 as of EMR 
AMI 3.8.0.  See http://aws.amazon.com/emr/spark

~ Jonathan

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, June 19, 2015 at 7:15 AM
To: user user@spark.apache.orgmailto:user@spark.apache.org
Subject: [ERROR] Insufficient Space

Hello Spark Experts,

I've been running a standalone Spark cluster on EC2 for a few months now, and 
today I get this error:

IOError: [Errno 28] No space left on device
Spark assembly has been built with Hive, including Datanucleus jars on classpath
OpenJDK 64-Bit Server VM warning: Insufficient space for shared memory file

I guess I need to resize the cluster. What's the best way to do that?

Thanks,
Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=eb8b2c49-6420-4983-b2d7-df3e87c4e321]ᐧ



Re: Spark on EMR

2015-06-17 Thread Kelly, Jonathan
Yes, for now it is a wrapper around the old install-spark BA, but that will 
change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 
was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming 
soon though, of course. Unfortunately, though install-spark is currently being 
used under the hood, passing -v,1.4.0 in the options is not supported.

Sent from Ninehttp://www.9folders.com/

From: Eugen Cepoi cepoi.eu...@gmail.com
Sent: Jun 17, 2015 6:37 AM
To: Hideyoshi Maeda
Cc: ayan guha;kamatsuoka;user
Subject: Re: Spark on EMR

It looks like it is a wrapper around 
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.

https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html

2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda 
hideyoshi.ma...@gmail.commailto:hideyoshi.ma...@gmail.com:
Any ideas what version of Spark is underneath?

i.e. is it 1.4? and is SparkR supported on Amazon EMR?

On Wed, Jun 17, 2015 at 12:06 AM, ayan guha 
guha.a...@gmail.commailto:guha.a...@gmail.com wrote:

That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline?

On 17 Jun 2015 05:29, kamatsuoka ken...@gmail.commailto:ken...@gmail.com 
wrote:
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=3d9e0d72-3cbe-4d6f-b262-829b92632515]ᐧ


On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)


In project/assembly.sbt I have only the following line:

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)

I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

Thanks,
Vadim




Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
Just remove provided from the end of the line where you specify the 
spark-streaming-kinesis-asl dependency.  That will cause that package and all 
of its transitive dependencies (including the KCL, the AWS Java SDK libraries 
and other transitive dependencies) to be included in your uber jar.  They all 
must be in there because they are not part of the Spark distribution in your 
cluster.

However, as I mentioned before, I think making this change might cause you to 
run into the same problems I spoke of in the thread I linked below 
(https://www.mail-archive.com/user@spark.apache.org/msg23891.html), and 
unfortunately I haven't solved that yet.

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Thanks. So how do I fix it?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=51a86f6a-7130-4760-aab3-f4368d8176b9]ᐧ


On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a provided dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
uber jar following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread main java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 % provided

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0

Re: Spark + Kinesis

2015-04-02 Thread Kelly, Jonathan
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like spark-core_2.10), what 
you actually want is to use just spark-core with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided

libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com
Date: Thursday, April 2, 2015 at 9:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := Kinesis Consumer
version := 1.0
organization := com.myconsumer
scalaVersion := 2.11.5

libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % 
provided,
org.apache.spark % spark-streaming_2.10 % 1.3.0
org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)

assemblySettings
jarName in assembly :=  consumer-assembly.jar
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)


In project/assembly.sbt I have only the following line:

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)

I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

Thanks,
Vadim

[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=ccce8efd-9f9f-4140-8f31-28b661c06314]ᐧ



Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the 
compute-classpath.sh script is not adding the datanucleus jars to the classpath 
because compute-classpath.sh is assuming to find the jar command in 
$JAVA_HOME/bin/jar, which does not exist for OpenJDK.  Is this an issue anybody 
else has run into?  Would it be possible to use the unzip command instead?

The fact that $JAVA_HOME/bin/jar is missing also breaks the check that ensures 
that Spark was built with a compatible version of java to the one being used to 
launch spark.  The unzip tool of course wouldn't work for this, but there's 
probably another easy alternative to $JAVA_HOME/bin/jar.

~ Jonathan Kelly


Re: When will 1.3.1 release?

2015-03-30 Thread Kelly, Jonathan
Are you referring to 
SPARK-6330https://issues.apache.org/jira/browse/SPARK-6330?

If you are able to build Spark from source yourself, I believe you should just 
need to cherry-pick the following commits in order to backport the fix:

67fa6d1f830dee37244b5a30684d797093c7c134 [SPARK-6330] Fix filesystem bug in 
newParquet relation
9d88f0cbdb3e994c3ab62eb7534b0b5308cc5265 [SPARK-6330] [SQL] Add a test case for 
SPARK-6330

(And if it's a different bug you're talking about, you could probably find the 
commit(s) that fix that bug and backport them instead.)

Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2

From: Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com
Date: Monday, March 30, 2015 at 12:34 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: When will 1.3.1 release?

Hi All,

I am waiting the spark 1.3.1 to fix the bug to work with S3 file system.

Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because 
there is jar compatible issue to work with AWS SDK.

Regards,

Shuai


Re: Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
Ah, never mind, I found the jar command in the java-1.7.0-openjdk-devel 
package.  I only had java-1.7.0-openjdk installed.  Looks like I just need to 
install java-1.7.0-openjdk-devel then set JAVA_HOME to /usr/lib/jvm/java 
instead of /usr/lib/jvm/jre.

~ Jonathan Kelly

From: Kelly, Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Monday, March 30, 2015 at 1:03 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark and OpenJDK - jar: No such file or directory

I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the 
compute-classpath.sh script is not adding the datanucleus jars to the classpath 
because compute-classpath.sh is assuming to find the jar command in 
$JAVA_HOME/bin/jar, which does not exist for OpenJDK.  Is this an issue anybody 
else has run into?  Would it be possible to use the unzip command instead?

The fact that $JAVA_HOME/bin/jar is missing also breaks the check that ensures 
that Spark was built with a compatible version of java to the one being used to 
launch spark.  The unzip tool of course wouldn't work for this, but there's 
probably another easy alternative to $JAVA_HOME/bin/jar.

~ Jonathan Kelly


Using Spark with a SOCKS proxy

2015-03-17 Thread Kelly, Jonathan
I'm trying to figure out how I might be able to use Spark with a SOCKS proxy.  
That is, my dream is to be able to write code in my IDE then run it without 
much trouble on a remote cluster, accessible only via a SOCKS proxy between the 
local development machine and the master node of the cluster (ignoring, for 
now, any dependencies that would need to be transferred--assume it's a very 
simple app with no dependencies that aren't part of the Spark classpath on the 
cluster).  This is possible with Hadoop by setting 
hadoop.rpc.socket.factory.class.default to 
org.apache.hadoop.net.SocksSocketFactory and hadoop.socks.server to 
localhost:port on which a SOCKS proxy has been opened via ssh -D to the 
master node.  However, I can't seem to find anything like this for Spark, and 
I only see very few mentions of it on the user list and on stackoverflow, with 
no real answers.  (See links below.)

I thought I might be able to use the JVM's -DsocksProxyHost and 
-DsocksProxyPort system properties, but it still does not seem to work.  That 
is, if I start a SOCKS proxy to my master node using something like ssh -D 
2600 master node public name then run a simple Spark app that calls 
SparkConf.setMaster(spark://master node private IP:7077), passing in JVM 
args of -DsocksProxyHost=locahost -DsocksProxyPort=2600, the driver hangs for 
a while before finally giving up (Application has been killed. Reason: All 
masters are unresponsive! Giving up.).  It seems like it is not even 
attempting to use the SOCKS proxy.  Do -DsocksProxyHost/-DsocksProxyPort not 
even work for Spark?

http://stackoverflow.com/questions/28047000/connect-to-spark-through-a-socks-proxy
 (unanswered similar question from somebody else about a month ago)
https://issues.apache.org/jira/browse/SPARK-5004 (unresolved, somewhat related 
JIRA from a few months ago)

Thanks,
Jonathan


Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Kelly, Jonathan
See https://issues.apache.org/jira/browse/SPARK-6351

~ Jonathan

From: Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com
Date: Monday, March 16, 2015 at 11:46 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

Hi All,

I just upgrade the system to use version 1.3.0, but then the 
sqlContext.parquetFile doesn’t work with s3n. I have test the same code with 
1.2.1 and it works.

A simple test running in spark-shell:

val parquetFile = sqlContext.parquetFile(s3n:///test/2.parq )
java.lang.IllegalArgumentException: Wrong FS: s3n:///test/2.parq, expected: 
file:///file:///\\

And same test work with spark-shell under 1.2.1

Regards,

Shuai


problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
I'm attempting to use the Spark Kinesis Connector, so I've added the following 
dependency in my build.sbt:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

My app works fine with sbt run, but I can't seem to get sbt assembly to 
work without failing with different file contents found errors due to 
different versions of various packages getting pulled in to the assembly.  This 
only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt 
assembly works fine otherwise.

Here are the conflicts that I see:

com.esotericsoftware.kryo:kryo:2.21
com.esotericsoftware.minlog:minlog:1.2

com.google.guava:guava:15.0
org.apache.spark:spark-network-common_2.10:1.3.0

(Note: The conflict is with javac.sh; why is this even getting included?)
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0
org.apache.spark:spark-streaming_2.10:1.3.0
org.apache.spark:spark-core_2.10:1.3.0
org.apache.spark:spark-network-common_2.10:1.3.0
org.apache.spark:spark-network-shuffle_2.10:1.3.0

(Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've 
upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of 
these conflicts, as I upgraded the dependency *because* I was getting all of 
these conflicts with the Spark 1.3.0 artifacts from the central repo.)
com.amazonaws:aws-java-sdk-s3:1.9.24
net.java.dev.jets3t:jets3t:0.9.3

commons-collections:commons-collections:3.2.1
commons-beanutils-commons-beanutils:1.7.0
commons-beanutils:commons-beanutils-core:1.8.0

commons-logging:commons-logging:1.1.3
org.slf4j:jcl-over-slf4j:1.7.10

(Note: The conflict is with a few package-info.class files, which seems really 
silly.)
org.apache.hadoop:hadoop-yarn-common:2.4.0
org.apache.hadoop:hadoop-yarn-api:2.4.0

(Note: The conflict is with org/apache/spark/unused/UnusedStubClass.class, 
which seems even more silly.)
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0
org.apache.spark:spark-streaming_2.10:1.3.0
org.apache.spark:spark-core_2.10:1.3.0
org.apache.spark:spark-network-common_2.10:1.3.0
org.spark-project.spark:unused:1.0.0 (?!?!?!)
org.apache.spark:spark-network-shuffle_2.10:1.3.0

I can get rid of some of the conflicts by using excludeAll() to exclude 
artifacts with organization = org.apache.hadoop or organization = 
org.apache.spark and name = spark-streaming, and I might be able to resolve 
a few other conflicts this way, but the bottom line is that this is way more 
complicated than it should be, so either something is really broken or I'm just 
doing something wrong.

Many of these don't even make sense to me.  For example, the very first 
conflict is between classes in com.esotericsoftware.kryo:kryo:2.21 and in 
com.esotericsoftware.minlog:minlog:1.2, but the former *depends* upon the 
latter, so ???  It seems wrong to me that one package would contain different 
versions of the same classes that are included in one of its dependencies.  I 
guess it doesn't make too much difference though if I could only get my 
assembly to include/exclude the right packages.  I of course don't want any of 
the spark or hadoop dependencies included (other than 
spark-streaming-kinesis-asl itself), but I want all of 
spark-streaming-kinesis-asl's dependencies included (such as the AWS Java SDK 
and its dependencies).  That doesn't seem to be possible without what I imagine 
will become an unruly and fragile exclusion list though.

Thanks,
Jonathan


Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
Yes, I do have the following dependencies marked as provided:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

However, spark-streaming-kinesis-asl has a compile time dependency on 
spark-streaming, so I think that causes it and its dependencies to be pulled 
into the assembly.  I expected that simply excluding spark-streaming in the 
spark-streaming-kinesis-asl dependency would solve this problem, but it does 
not.  That is, this doesn't work either:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 exclude(org.apache.spark, spark-streaming)

As I mentioned originally, the following solved some but not all conflicts:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 excludeAll(
  ExclusionRule(organization = org.apache.hadoop),
  ExclusionRule(organization = org.apache.spark, name = spark-streaming)
)

(Note that ExclusionRule(organization = org.apache.spark) without the name 
attribute does not work because that apparently causes it to exclude even 
spark-streaming-kinesis-asl.)

Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2

From: Tathagata Das t...@databricks.commailto:t...@databricks.com
Date: Monday, March 16, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly 
(different file contents found)

If you are creating an assembly, make sure spark-streaming is marked as 
provided. spark-streaming is already part of the spark installation so will be 
present at run time. That might solve some of these, may be!?

TD

On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
I'm attempting to use the Spark Kinesis Connector, so I've added the following 
dependency in my build.sbt:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

My app works fine with sbt run, but I can't seem to get sbt assembly to 
work without failing with different file contents found errors due to 
different versions of various packages getting pulled in to the assembly.  This 
only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt 
assembly works fine otherwise.

Here are the conflicts that I see:

com.esotericsoftware.kryo:kryo:2.21
com.esotericsoftware.minlog:minlog:1.2

com.google.guava:guava:15.0
org.apache.spark:spark-network-common_2.10:1.3.0

(Note: The conflict is with javac.sh; why is this even getting included?)
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0
org.apache.spark:spark-streaming_2.10:1.3.0
org.apache.spark:spark-core_2.10:1.3.0
org.apache.spark:spark-network-common_2.10:1.3.0
org.apache.spark:spark-network-shuffle_2.10:1.3.0

(Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've 
upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of 
these conflicts, as I upgraded the dependency *because* I was getting all of 
these conflicts with the Spark 1.3.0 artifacts from the central repo.)
com.amazonaws:aws-java-sdk-s3:1.9.24
net.java.dev.jets3t:jets3t:0.9.3

commons-collections:commons-collections:3.2.1
commons-beanutils-commons-beanutils:1.7.0
commons-beanutils:commons-beanutils-core:1.8.0

commons-logging:commons-logging:1.1.3
org.slf4j:jcl-over-slf4j:1.7.10

(Note: The conflict is with a few package-info.class files, which seems really 
silly.)
org.apache.hadoop:hadoop-yarn-common:2.4.0
org.apache.hadoop:hadoop-yarn-api:2.4.0

(Note: The conflict is with org/apache/spark/unused/UnusedStubClass.class, 
which seems even more silly.)
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0
org.apache.spark:spark-streaming_2.10:1.3.0
org.apache.spark:spark-core_2.10:1.3.0
org.apache.spark:spark-network-common_2.10:1.3.0
org.spark-project.spark:unused:1.0.0 (?!?!?!)
org.apache.spark:spark-network-shuffle_2.10:1.3.0

I can get rid of some of the conflicts by using excludeAll() to exclude 
artifacts with organization = org.apache.hadoop or organization = 
org.apache.spark and name = spark-streaming, and I might be able to resolve 
a few other conflicts this way, but the bottom line is that this is way more 
complicated than it should be, so either something is really broken or I'm just 
doing something wrong.

Many of these don't even make sense to me.  For example, the very first 
conflict is between classes in com.esotericsoftware.kryo:kryo:2.21 and in 
com.esotericsoftware.minlog:minlog:1.2, but the former *depends* upon the 
latter, so ???  It seems wrong to me that one package would contain different 
versions of the same classes

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Kelly, Jonathan
Here's build.sbt, minus blank lines for brevity, and without any of the 
exclude/excludeAll options that I've attempted:

name := spark-sandbox
version := 1.0
scalaVersion := 2.10.4
resolvers += Akka Repository at http://repo.akka.io/releases/;
run in Compile = Defaults.runTask(fullClasspath in Compile, mainClass 
in(Compile, run), runner in(Compile, run))
libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

And here's project/plugins.sbt:

logLevel := Level.Warn
addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)

My app is just a simple word count app so far.

Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2

From: Tathagata Das t...@databricks.commailto:t...@databricks.com
Date: Monday, March 16, 2015 at 1:00 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly 
(different file contents found)

Can you give use your SBT project? Minus the source codes if you don't wish to 
expose them.

TD

On Mon, Mar 16, 2015 at 12:54 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
Yes, I do have the following dependencies marked as provided:

libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided
libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % 
provided

However, spark-streaming-kinesis-asl has a compile time dependency on 
spark-streaming, so I think that causes it and its dependencies to be pulled 
into the assembly.  I expected that simply excluding spark-streaming in the 
spark-streaming-kinesis-asl dependency would solve this problem, but it does 
not.  That is, this doesn't work either:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 exclude(org.apache.spark, spark-streaming)

As I mentioned originally, the following solved some but not all conflicts:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0 excludeAll(
  ExclusionRule(organization = org.apache.hadoop),
  ExclusionRule(organization = org.apache.spark, name = spark-streaming)
)

(Note that ExclusionRule(organization = org.apache.spark) without the name 
attribute does not work because that apparently causes it to exclude even 
spark-streaming-kinesis-asl.)

Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2

From: Tathagata Das t...@databricks.commailto:t...@databricks.com
Date: Monday, March 16, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly 
(different file contents found)

If you are creating an assembly, make sure spark-streaming is marked as 
provided. spark-streaming is already part of the spark installation so will be 
present at run time. That might solve some of these, may be!?

TD

On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
I'm attempting to use the Spark Kinesis Connector, so I've added the following 
dependency in my build.sbt:

libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 
1.3.0

My app works fine with sbt run, but I can't seem to get sbt assembly to 
work without failing with different file contents found errors due to 
different versions of various packages getting pulled in to the assembly.  This 
only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt 
assembly works fine otherwise.

Here are the conflicts that I see:

com.esotericsoftware.kryo:kryo:2.21
com.esotericsoftware.minlog:minlog:1.2

com.google.guava:guava:15.0
org.apache.spark:spark-network-common_2.10:1.3.0

(Note: The conflict is with javac.sh; why is this even getting included?)
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0
org.apache.spark:spark-streaming_2.10:1.3.0
org.apache.spark:spark-core_2.10:1.3.0
org.apache.spark:spark-network-common_2.10:1.3.0
org.apache.spark:spark-network-shuffle_2.10:1.3.0

(Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've 
upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of 
these conflicts, as I upgraded the dependency *because* I was getting all of 
these conflicts with the Spark 1.3.0 artifacts from the central repo.)
com.amazonaws:aws-java-sdk-s3

Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
That's probably a good thing to have, so I'll add it, but unfortunately it
did not help this issue.  It looks like the hadoop-2.4 profile only sets
these properties, which don't seem like they would affect anything related
to Netty:

  properties
hadoop.version2.4.0/hadoop.version
protobuf.version2.5.0/protobuf.version
jets3t.version0.9.0/jets3t.version
commons.math3.version3.1.1/commons.math3.version
avro.mapred.classifierhadoop2/avro.mapred.classifier
  /properties


Thanks,
Jonathan Kelly




On 3/5/15, 1:09 PM, Patrick Wendell pwend...@gmail.com wrote:

You may need to add the -Phadoop-2.4 profile. When building or release
packages for Hadoop 2.4 we use the following flags:

-Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn

- Patrick

On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com
wrote:
 I confirmed that this has nothing to do with BigTop by running the same
mvn
 command directly in a fresh clone of the Spark package at the v1.2.1
tag.  I
 got the same exact error.


 Jonathan Kelly

 Elastic MapReduce - SDE

 Port 99 (SEA35) 08.220.C2


 From: Kelly, Jonathan Kelly jonat...@amazon.com
 Date: Thursday, March 5, 2015 at 10:39 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink
(due
 to missing Netty library)

 I'm running into an issue building Spark v1.2.1 (as well as the latest
in
 branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop
(v0.9,
 which is not quite released yet).  The build fails in the External Flume
 Sink subproject with the following error:

 [INFO] Compiling 5 Scala sources and 3 Java sources to
 
/workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/e
xternal/flume-sink/target/scala-2.10/classes...
 [WARNING] Class org.jboss.netty.channel.ChannelFactory not found -
 continuing with a stub.
 [ERROR] error while loading NettyServer, class file
 
'/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.
7.6.jar(org/apache/avro/ipc/NettyServer.class)'
 is broken
 (class java.lang.NullPointerException/null)
 [WARNING] one warning found
 [ERROR] one error found

 It seems like what is happening is that the Netty library is missing at
 build time, which happens because it is explicitly excluded in the
pom.xml
 (see
 
https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L
42).
 I attempted removing the exclusions and the explicit re-add for the test
 scope on lines 77-88, and that allowed the build to succeed, though I
don't
 know if that will cause problems at runtime.  I don't have any
experience
 with the Flume Sink, so I don't really know how to test it.  (And, to be
 clear, I'm not necessarily trying to get the Flume Sink to work-- I just
 want the project to build successfully, though of course I'd still want
the
 Flume Sink to work for whomever does need it.)

 Does anybody have any idea what's going on here?  Here is the command
BigTop
 is running to build Spark:

 mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl
 -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2
 -Duser.home=/home/ec2-user -Drepo.maven.org=
 -Dreactor.repo=file:///home/ec2-user/.m2/repository
 -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT
-Dyarn.version=2.4.0-amzn-3-SNAPSHOT
 -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3
-Dscala.binary.version=2.10
 -DskipTests -DrecompileMode=all install

 As I mentioned above, if I switch to the latest in branch-1.2, to
 v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error.
 I
 was not getting the error with Spark v1.1.0, though there weren't any
 changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1.


 ~ Jonathan Kelly


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
I confirmed that this has nothing to do with BigTop by running the same mvn 
command directly in a fresh clone of the Spark package at the v1.2.1 tag.  I 
got the same exact error.

Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2

From: Kelly, Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Thursday, March 5, 2015 at 10:39 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to 
missing Netty library)

I'm running into an issue building Spark v1.2.1 (as well as the latest in 
branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, 
which is not quite released yet).  The build fails in the External Flume Sink 
subproject with the following error:

[INFO] Compiling 5 Scala sources and 3 Java sources to 
/workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes...
[WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
with a stub.
[ERROR] error while loading NettyServer, class file 
'/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)'
 is broken
(class java.lang.NullPointerException/null)
[WARNING] one warning found
[ERROR] one error found

It seems like what is happening is that the Netty library is missing at build 
time, which happens because it is explicitly excluded in the pom.xml (see 
https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42).  
I attempted removing the exclusions and the explicit re-add for the test scope 
on lines 77-88, and that allowed the build to succeed, though I don't know if 
that will cause problems at runtime.  I don't have any experience with the 
Flume Sink, so I don't really know how to test it.  (And, to be clear, I'm not 
necessarily trying to get the Flume Sink to work-- I just want the project to 
build successfully, though of course I'd still want the Flume Sink to work for 
whomever does need it.)

Does anybody have any idea what's going on here?  Here is the command BigTop is 
running to build Spark:

mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl 
-Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 
-Duser.home=/home/ec2-user -Drepo.maven.org= 
-Dreactor.repo=file:///home/ec2-user/.m2/repository 
-Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT 
-Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 
-DskipTests -DrecompileMode=all install

As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, 
or to the latest in branch-1.3, I get the same exact error.  I was not getting 
the error with Spark v1.1.0, though there weren't any changes to the 
external/flume-sink/pom.xml between v1.1.0 and v1.2.1.

~ Jonathan Kelly


Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
I'm running into an issue building Spark v1.2.1 (as well as the latest in 
branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, 
which is not quite released yet).  The build fails in the External Flume Sink 
subproject with the following error:

[INFO] Compiling 5 Scala sources and 3 Java sources to 
/workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes...
[WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing 
with a stub.
[ERROR] error while loading NettyServer, class file 
'/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)'
 is broken
(class java.lang.NullPointerException/null)
[WARNING] one warning found
[ERROR] one error found

It seems like what is happening is that the Netty library is missing at build 
time, which happens because it is explicitly excluded in the pom.xml (see 
https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42).  
I attempted removing the exclusions and the explicit re-add for the test scope 
on lines 77-88, and that allowed the build to succeed, though I don't know if 
that will cause problems at runtime.  I don't have any experience with the 
Flume Sink, so I don't really know how to test it.  (And, to be clear, I'm not 
necessarily trying to get the Flume Sink to work-- I just want the project to 
build successfully, though of course I'd still want the Flume Sink to work for 
whomever does need it.)

Does anybody have any idea what's going on here?  Here is the command BigTop is 
running to build Spark:

mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl 
-Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 
-Duser.home=/home/ec2-user -Drepo.maven.org= 
-Dreactor.repo=file:///home/ec2-user/.m2/repository 
-Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT 
-Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 
-DskipTests -DrecompileMode=all install

As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, 
or to the latest in branch-1.3, I get the same exact error.  I was not getting 
the error with Spark v1.1.0, though there weren't any changes to the 
external/flume-sink/pom.xml between v1.1.0 and v1.2.1.

~ Jonathan Kelly


Re: kinesis multiple records adding into stream

2015-01-16 Thread Kelly, Jonathan
Are you referring to the PutRecords method, which was added in 1.9.9?  (See 
http://aws.amazon.com/releasenotes/1369906126177804)  If so, can't you just 
depend upon this later version of the SDK in your app even though 
spark-streaming-kinesis-asl is depending upon this earlier 1.9.3 version that 
does not yet have the PutRecords call?

Also, could you please explain your use case more fully?  
spark-streaming-kinesis-asl is for *reading* data from Kinesis in Spark, not 
for writing data, so I would expect that the part of your code that would be 
writing to Kinesis would be a totally separate app anyway (unless you are 
reading from Kinesis using spark-streaming-kinesis-asl, transforming it 
somehow, then writing it back out to Kinesis).

~ Jonathan

From: Aniket Bhatnagar 
aniket.bhatna...@gmail.commailto:aniket.bhatna...@gmail.com
Date: Friday, January 16, 2015 at 9:13 AM
To: Hafiz Mujadid hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: kinesis multiple records adding into stream


Sorry. I couldn't understand the issue. Are you trying to send data to kinesis 
from a spark batch/real time job?

- Aniket

On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid 
hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote:
Hi Experts!

I am using kinesis dependency as follow
groupId = org.apache.spark
 artifactId = spark-streaming-kinesis-asl_2.10
 version = 1.2.0

in this aws sdk version 1.8.3 is being used. in this sdk multiple records
can not be put in a single request. is it possible to put multiple records
in a single request ?


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kinesis-multiple-records-adding-into-stream-tp21191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-27 Thread Kelly, Jonathan
Yeah, only a few hours after I sent my message I saw some correspondence on 
this other thread: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
 which is the exact same issue.  Glad to find that this should be fixed in 
1.2.0!  I'll give that a try later.

Thanks a lot,
Jonathan

From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Thursday, November 27, 2014 at 4:37 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded 
from a JSON file using schema auto-detection

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive 
table. Hive does not have the notion of containsNull for array values. So, 
for a Hive table, the containsNull will be always true for an array and we 
should ignore this field for Hive. This issue has been fixed by 
https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 
1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
After playing around with this a little more, I discovered that:

1. If test.json contains something like {values:[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have element: integer
(containsNull = true), and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
StructType(Seq(StructField(values, ArrayType(IntegerType, true),
true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

   {values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
   at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
2
47)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
   at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
a
:84)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:66)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:50)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
q
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
a
la:149)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1
145)
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:
615)
   at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table

SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

{values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2
47)
at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala
:84)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:66)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala sqlContext.jsonFile(file:///home/hadoop/test.json,
sqlContext.table(test).schema).insertInto(test)
scala sqlContext.sql(select * from test).collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
After playing around with this a little more, I discovered that:

1. If test.json contains something like {values:[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have element: integer
(containsNull = true), and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
StructType(Seq(StructField(values, ArrayType(IntegerType, true),
true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote:

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

   {values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
   at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
2
47)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
   at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
a
:84)
   at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:66)
   at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:50)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
q
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
a
la:149)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1
145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:
615)
   at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala sqlContext.jsonFile(file:///home/hadoop/test.json,
sqlContext.table(test).schema).insertInto(test)
scala sqlContext.sql(select * from test).collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org