Re: How to access Spark UI through AWS
I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History link for completed applications) to get to the Spark UI from there. The ApplicationMaster link will use the YARN Proxy Service (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark application's UI, regardless of what port it's running on. For completed applications, the History link will send you directly to the Spark History Server UI on port 18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI -through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to pass both -c spark.executor.instances (or --num-executors) *and* -c spark.dynamicAllocation.enabled=true to spark-submit on the command line (as opposed to having one of them in spark-defaults.conf and one of them in the spark-submit args), but currently there doesn't seem to be any way to distinguish between arguments that were actually passed to spark-submit and settings that simply came from spark-defaults.conf. If there were a way to distinguish them, I think the ideal situation would be for the validation exception to be thrown only if spark.executor.instances and spark.dynamicAllocation.enabled=true were both passed via spark-submit args or were both present in spark-defaults.conf, but passing spark.dynamicAllocation.enabled=true to spark-submit would take precedence over spark.executor.instances configured in spark-defaults.conf, and vice versa. Jonathan Kelly Elastic MapReduce - SDE Blackfoot (SEA33) 06.850.F0 From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
bump From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Spark ec2 cluster lost worker
Yeah, sorry, I didn't really answer your question due to my bias for EMR. =P Unfortunately, also due to my bias, I have not actually tried using a straight EC2 cluster as opposed to Spark on EMR. I'm not sure how the slave nodes get set up when running Spark on EC2, but I would imagine that it must set them up with SSH keys so that the master can connect to them without a password. So are you able to SSH to the master node then from there SSH to the slave node? If so, maybe you can find out what's going on with that slave node and maybe restart the Spark worker? ~ Jonathan Kelly From: Anny Chen anny9...@gmail.commailto:anny9...@gmail.com Date: Wednesday, June 24, 2015 at 6:15 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark ec2 cluster lost worker Hi Jonathan, Thanks for this information! I will take a look into it. However is there a way to reconnect the lost node? Or there's no way that I could do to find back the lost worker? Thanks! Anny On Wed, Jun 24, 2015 at 6:06 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: Just curious, would you be able to use Spark on EMR rather than on EC2? Spark on EMR will handle lost nodes for you, and it will let you scale your cluster up and down or clone a cluster (its config, that is, not the data stored in HDFS), among other things. We also recently announced official support for Spark on EMR: http://aws.amazon.com/emr/spark ~ Jonathan Kelly (from Amazon AWS EMR) On 6/24/15, 5:58 PM, anny9699 anny9...@gmail.commailto:anny9...@gmail.com wrote: Hi, According to the Spark UI, one worker is lost after a failed job. It is not a lost executor error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is running and no check alarms. So I am confused how I could reconnect the lost machine in aws ec2? I met this problem before, and my solution was to rebuilt a new cluster. However now it is a little hard to rebuild a cluster, so I am wondering if there's some way to find back the lost machine? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost -worker-tp23482.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: Spark ec2 cluster lost worker
Just curious, would you be able to use Spark on EMR rather than on EC2? Spark on EMR will handle lost nodes for you, and it will let you scale your cluster up and down or clone a cluster (its config, that is, not the data stored in HDFS), among other things. We also recently announced official support for Spark on EMR: http://aws.amazon.com/emr/spark ~ Jonathan Kelly (from Amazon AWS EMR) On 6/24/15, 5:58 PM, anny9699 anny9...@gmail.com wrote: Hi, According to the Spark UI, one worker is lost after a failed job. It is not a lost executor error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is running and no check alarms. So I am confused how I could reconnect the lost machine in aws ec2? I met this problem before, and my solution was to rebuilt a new cluster. However now it is a little hard to rebuild a cluster, so I am wondering if there's some way to find back the lost machine? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost -worker-tp23482.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [ERROR] Insufficient Space
Would you be able to use Spark on EMR rather than on EC2? EMR clusters allow easy resizing of the cluster, and EMR also now supports Spark 1.3.1 as of EMR AMI 3.8.0. See http://aws.amazon.com/emr/spark ~ Jonathan From: Vadim Bichutskiy vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Friday, June 19, 2015 at 7:15 AM To: user user@spark.apache.orgmailto:user@spark.apache.org Subject: [ERROR] Insufficient Space Hello Spark Experts, I've been running a standalone Spark cluster on EC2 for a few months now, and today I get this error: IOError: [Errno 28] No space left on device Spark assembly has been built with Hive, including Datanucleus jars on classpath OpenJDK 64-Bit Server VM warning: Insufficient space for shared memory file I guess I need to resize the cluster. What's the best way to do that? Thanks, Vadim [https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=eb8b2c49-6420-4983-b2d7-df3e87c4e321]ᐧ
Re: Spark on EMR
Yes, for now it is a wrapper around the old install-spark BA, but that will change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming soon though, of course. Unfortunately, though install-spark is currently being used under the hood, passing -v,1.4.0 in the options is not supported. Sent from Ninehttp://www.9folders.com/ From: Eugen Cepoi cepoi.eu...@gmail.com Sent: Jun 17, 2015 6:37 AM To: Hideyoshi Maeda Cc: ayan guha;kamatsuoka;user Subject: Re: Spark on EMR It looks like it is a wrapper around https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark So basically adding an option -v,1.4.0.a should work. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html 2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda hideyoshi.ma...@gmail.commailto:hideyoshi.ma...@gmail.com: Any ideas what version of Spark is underneath? i.e. is it 1.4? and is SparkR supported on Amazon EMR? On Wed, Jun 17, 2015 at 12:06 AM, ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com wrote: That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline? On 17 Jun 2015 05:29, kamatsuoka ken...@gmail.commailto:ken...@gmail.com wrote: Spark is now officially supported on Amazon Elastic Map Reduce: http://aws.amazon.com/elasticmapreduce/details/spark/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EMR-tp23343.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: Spark + Kinesis
spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies were not included in the assembly (but yes, they should be). ~ Jonathan Kelly From: Vadim Bichutskiy vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Friday, April 3, 2015 at 12:26 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark + Kinesis Hi all, Good news! I was able to create a Kinesis consumer and assemble it into an uber jar following http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and example https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala. However when I try to spark-submit it I get the following exception: Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Do I need to include KCL dependency in build.sbt, here's what it looks like currently: import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 % provided assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false) Any help appreciated. Thanks, Vadim [https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=3d9e0d72-3cbe-4d6f-b262-829b92632515]ᐧ On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build.sbt. Instead of placing the Scala version after the artifactId (like spark-core_2.10), what you actually want is to use just spark-core with two percent signs before it. Using two percent signs will make it use the version of Scala that matches your declared scalaVersion. For example: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 I think that may get you a little closer, though I think you're probably going to run into the same problems I ran into in this thread: https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never really got an answer for that, and I temporarily moved on to other things for now. ~ Jonathan Kelly From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Thursday, April 2, 2015 at 9:53 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark + Kinesis Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify build.sbt: --- import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 % 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0) assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false) In project/assembly.sbt I have only the following line: addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book. Thanks, Vadim
Re: Spark + Kinesis
Just remove provided from the end of the line where you specify the spark-streaming-kinesis-asl dependency. That will cause that package and all of its transitive dependencies (including the KCL, the AWS Java SDK libraries and other transitive dependencies) to be included in your uber jar. They all must be in there because they are not part of the Spark distribution in your cluster. However, as I mentioned before, I think making this change might cause you to run into the same problems I spoke of in the thread I linked below (https://www.mail-archive.com/user@spark.apache.org/msg23891.html), and unfortunately I haven't solved that yet. ~ Jonathan Kelly From: Vadim Bichutskiy vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Friday, April 3, 2015 at 12:45 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark + Kinesis Thanks. So how do I fix it? [https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=51a86f6a-7130-4760-aab3-f4368d8176b9]ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies were not included in the assembly (but yes, they should be). ~ Jonathan Kelly From: Vadim Bichutskiy vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Friday, April 3, 2015 at 12:26 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark + Kinesis Hi all, Good news! I was able to create a Kinesis consumer and assemble it into an uber jar following http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and example https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala. However when I try to spark-submit it I get the following exception: Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Do I need to include KCL dependency in build.sbt, here's what it looks like currently: import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 % provided assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false) Any help appreciated. Thanks, Vadim On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build.sbt. Instead of placing the Scala version after the artifactId (like spark-core_2.10), what you actually want is to use just spark-core with two percent signs before it. Using two percent signs will make it use the version of Scala that matches your declared scalaVersion. For example: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 I think that may get you a little closer, though I think you're probably going to run into the same problems I ran into in this thread: https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never really got an answer for that, and I temporarily moved on to other things for now. ~ Jonathan Kelly From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Thursday, April 2, 2015 at 9:53 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark + Kinesis Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify build.sbt: --- import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 % 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0
Re: Spark + Kinesis
It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build.sbt. Instead of placing the Scala version after the artifactId (like spark-core_2.10), what you actually want is to use just spark-core with two percent signs before it. Using two percent signs will make it use the version of Scala that matches your declared scalaVersion. For example: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 I think that may get you a little closer, though I think you're probably going to run into the same problems I ran into in this thread: https://www.mail-archive.com/user@spark.apache.org/msg23891.html I never really got an answer for that, and I temporarily moved on to other things for now. ~ Jonathan Kelly From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.commailto:vadim.bichuts...@gmail.com Date: Thursday, April 2, 2015 at 9:53 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark + Kinesis Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify build.sbt: --- import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % provided, org.apache.spark % spark-streaming_2.10 % 1.3.0 org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0) assemblySettings jarName in assembly := consumer-assembly.jar assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false) In project/assembly.sbt I have only the following line: addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book. Thanks, Vadim [https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=ccce8efd-9f9f-4140-8f31-28b661c06314]ᐧ
Spark and OpenJDK - jar: No such file or directory
I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the compute-classpath.sh script is not adding the datanucleus jars to the classpath because compute-classpath.sh is assuming to find the jar command in $JAVA_HOME/bin/jar, which does not exist for OpenJDK. Is this an issue anybody else has run into? Would it be possible to use the unzip command instead? The fact that $JAVA_HOME/bin/jar is missing also breaks the check that ensures that Spark was built with a compatible version of java to the one being used to launch spark. The unzip tool of course wouldn't work for this, but there's probably another easy alternative to $JAVA_HOME/bin/jar. ~ Jonathan Kelly
Re: When will 1.3.1 release?
Are you referring to SPARK-6330https://issues.apache.org/jira/browse/SPARK-6330? If you are able to build Spark from source yourself, I believe you should just need to cherry-pick the following commits in order to backport the fix: 67fa6d1f830dee37244b5a30684d797093c7c134 [SPARK-6330] Fix filesystem bug in newParquet relation 9d88f0cbdb3e994c3ab62eb7534b0b5308cc5265 [SPARK-6330] [SQL] Add a test case for SPARK-6330 (And if it's a different bug you're talking about, you could probably find the commit(s) that fix that bug and backport them instead.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com Date: Monday, March 30, 2015 at 12:34 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: When will 1.3.1 release? Hi All, I am waiting the spark 1.3.1 to fix the bug to work with S3 file system. Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because there is jar compatible issue to work with AWS SDK. Regards, Shuai
Re: Spark and OpenJDK - jar: No such file or directory
Ah, never mind, I found the jar command in the java-1.7.0-openjdk-devel package. I only had java-1.7.0-openjdk installed. Looks like I just need to install java-1.7.0-openjdk-devel then set JAVA_HOME to /usr/lib/jvm/java instead of /usr/lib/jvm/jre. ~ Jonathan Kelly From: Kelly, Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Monday, March 30, 2015 at 1:03 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark and OpenJDK - jar: No such file or directory I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the compute-classpath.sh script is not adding the datanucleus jars to the classpath because compute-classpath.sh is assuming to find the jar command in $JAVA_HOME/bin/jar, which does not exist for OpenJDK. Is this an issue anybody else has run into? Would it be possible to use the unzip command instead? The fact that $JAVA_HOME/bin/jar is missing also breaks the check that ensures that Spark was built with a compatible version of java to the one being used to launch spark. The unzip tool of course wouldn't work for this, but there's probably another easy alternative to $JAVA_HOME/bin/jar. ~ Jonathan Kelly
Using Spark with a SOCKS proxy
I'm trying to figure out how I might be able to use Spark with a SOCKS proxy. That is, my dream is to be able to write code in my IDE then run it without much trouble on a remote cluster, accessible only via a SOCKS proxy between the local development machine and the master node of the cluster (ignoring, for now, any dependencies that would need to be transferred--assume it's a very simple app with no dependencies that aren't part of the Spark classpath on the cluster). This is possible with Hadoop by setting hadoop.rpc.socket.factory.class.default to org.apache.hadoop.net.SocksSocketFactory and hadoop.socks.server to localhost:port on which a SOCKS proxy has been opened via ssh -D to the master node. However, I can't seem to find anything like this for Spark, and I only see very few mentions of it on the user list and on stackoverflow, with no real answers. (See links below.) I thought I might be able to use the JVM's -DsocksProxyHost and -DsocksProxyPort system properties, but it still does not seem to work. That is, if I start a SOCKS proxy to my master node using something like ssh -D 2600 master node public name then run a simple Spark app that calls SparkConf.setMaster(spark://master node private IP:7077), passing in JVM args of -DsocksProxyHost=locahost -DsocksProxyPort=2600, the driver hangs for a while before finally giving up (Application has been killed. Reason: All masters are unresponsive! Giving up.). It seems like it is not even attempting to use the SOCKS proxy. Do -DsocksProxyHost/-DsocksProxyPort not even work for Spark? http://stackoverflow.com/questions/28047000/connect-to-spark-through-a-socks-proxy (unanswered similar question from somebody else about a month ago) https://issues.apache.org/jira/browse/SPARK-5004 (unresolved, somewhat related JIRA from a few months ago) Thanks, Jonathan
Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0
See https://issues.apache.org/jira/browse/SPARK-6351 ~ Jonathan From: Shuai Zheng szheng.c...@gmail.commailto:szheng.c...@gmail.com Date: Monday, March 16, 2015 at 11:46 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: sqlContext.parquetFile doesn't work with s3n in version 1.3.0 Hi All, I just upgrade the system to use version 1.3.0, but then the sqlContext.parquetFile doesn’t work with s3n. I have test the same code with 1.2.1 and it works. A simple test running in spark-shell: val parquetFile = sqlContext.parquetFile(s3n:///test/2.parq ) java.lang.IllegalArgumentException: Wrong FS: s3n:///test/2.parq, expected: file:///file:///\\ And same test work with spark-shell under 1.2.1 Regards, Shuai
problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)
I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 My app works fine with sbt run, but I can't seem to get sbt assembly to work without failing with different file contents found errors due to different versions of various packages getting pulled in to the assembly. This only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt assembly works fine otherwise. Here are the conflicts that I see: com.esotericsoftware.kryo:kryo:2.21 com.esotericsoftware.minlog:minlog:1.2 com.google.guava:guava:15.0 org.apache.spark:spark-network-common_2.10:1.3.0 (Note: The conflict is with javac.sh; why is this even getting included?) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.apache.spark:spark-network-shuffle_2.10:1.3.0 (Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of these conflicts, as I upgraded the dependency *because* I was getting all of these conflicts with the Spark 1.3.0 artifacts from the central repo.) com.amazonaws:aws-java-sdk-s3:1.9.24 net.java.dev.jets3t:jets3t:0.9.3 commons-collections:commons-collections:3.2.1 commons-beanutils-commons-beanutils:1.7.0 commons-beanutils:commons-beanutils-core:1.8.0 commons-logging:commons-logging:1.1.3 org.slf4j:jcl-over-slf4j:1.7.10 (Note: The conflict is with a few package-info.class files, which seems really silly.) org.apache.hadoop:hadoop-yarn-common:2.4.0 org.apache.hadoop:hadoop-yarn-api:2.4.0 (Note: The conflict is with org/apache/spark/unused/UnusedStubClass.class, which seems even more silly.) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.spark-project.spark:unused:1.0.0 (?!?!?!) org.apache.spark:spark-network-shuffle_2.10:1.3.0 I can get rid of some of the conflicts by using excludeAll() to exclude artifacts with organization = org.apache.hadoop or organization = org.apache.spark and name = spark-streaming, and I might be able to resolve a few other conflicts this way, but the bottom line is that this is way more complicated than it should be, so either something is really broken or I'm just doing something wrong. Many of these don't even make sense to me. For example, the very first conflict is between classes in com.esotericsoftware.kryo:kryo:2.21 and in com.esotericsoftware.minlog:minlog:1.2, but the former *depends* upon the latter, so ??? It seems wrong to me that one package would contain different versions of the same classes that are included in one of its dependencies. I guess it doesn't make too much difference though if I could only get my assembly to include/exclude the right packages. I of course don't want any of the spark or hadoop dependencies included (other than spark-streaming-kinesis-asl itself), but I want all of spark-streaming-kinesis-asl's dependencies included (such as the AWS Java SDK and its dependencies). That doesn't seem to be possible without what I imagine will become an unruly and fragile exclusion list though. Thanks, Jonathan
Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)
Yes, I do have the following dependencies marked as provided: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided However, spark-streaming-kinesis-asl has a compile time dependency on spark-streaming, so I think that causes it and its dependencies to be pulled into the assembly. I expected that simply excluding spark-streaming in the spark-streaming-kinesis-asl dependency would solve this problem, but it does not. That is, this doesn't work either: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 exclude(org.apache.spark, spark-streaming) As I mentioned originally, the following solved some but not all conflicts: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 excludeAll( ExclusionRule(organization = org.apache.hadoop), ExclusionRule(organization = org.apache.spark, name = spark-streaming) ) (Note that ExclusionRule(organization = org.apache.spark) without the name attribute does not work because that apparently causes it to exclude even spark-streaming-kinesis-asl.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date: Monday, March 16, 2015 at 12:45 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found) If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 My app works fine with sbt run, but I can't seem to get sbt assembly to work without failing with different file contents found errors due to different versions of various packages getting pulled in to the assembly. This only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt assembly works fine otherwise. Here are the conflicts that I see: com.esotericsoftware.kryo:kryo:2.21 com.esotericsoftware.minlog:minlog:1.2 com.google.guava:guava:15.0 org.apache.spark:spark-network-common_2.10:1.3.0 (Note: The conflict is with javac.sh; why is this even getting included?) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.apache.spark:spark-network-shuffle_2.10:1.3.0 (Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of these conflicts, as I upgraded the dependency *because* I was getting all of these conflicts with the Spark 1.3.0 artifacts from the central repo.) com.amazonaws:aws-java-sdk-s3:1.9.24 net.java.dev.jets3t:jets3t:0.9.3 commons-collections:commons-collections:3.2.1 commons-beanutils-commons-beanutils:1.7.0 commons-beanutils:commons-beanutils-core:1.8.0 commons-logging:commons-logging:1.1.3 org.slf4j:jcl-over-slf4j:1.7.10 (Note: The conflict is with a few package-info.class files, which seems really silly.) org.apache.hadoop:hadoop-yarn-common:2.4.0 org.apache.hadoop:hadoop-yarn-api:2.4.0 (Note: The conflict is with org/apache/spark/unused/UnusedStubClass.class, which seems even more silly.) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.spark-project.spark:unused:1.0.0 (?!?!?!) org.apache.spark:spark-network-shuffle_2.10:1.3.0 I can get rid of some of the conflicts by using excludeAll() to exclude artifacts with organization = org.apache.hadoop or organization = org.apache.spark and name = spark-streaming, and I might be able to resolve a few other conflicts this way, but the bottom line is that this is way more complicated than it should be, so either something is really broken or I'm just doing something wrong. Many of these don't even make sense to me. For example, the very first conflict is between classes in com.esotericsoftware.kryo:kryo:2.21 and in com.esotericsoftware.minlog:minlog:1.2, but the former *depends* upon the latter, so ??? It seems wrong to me that one package would contain different versions of the same classes
Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)
Here's build.sbt, minus blank lines for brevity, and without any of the exclude/excludeAll options that I've attempted: name := spark-sandbox version := 1.0 scalaVersion := 2.10.4 resolvers += Akka Repository at http://repo.akka.io/releases/; run in Compile = Defaults.runTask(fullClasspath in Compile, mainClass in(Compile, run), runner in(Compile, run)) libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 And here's project/plugins.sbt: logLevel := Level.Warn addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) My app is just a simple word count app so far. Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date: Monday, March 16, 2015 at 1:00 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found) Can you give use your SBT project? Minus the source codes if you don't wish to expose them. TD On Mon, Mar 16, 2015 at 12:54 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: Yes, I do have the following dependencies marked as provided: libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-hive % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-sql % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided However, spark-streaming-kinesis-asl has a compile time dependency on spark-streaming, so I think that causes it and its dependencies to be pulled into the assembly. I expected that simply excluding spark-streaming in the spark-streaming-kinesis-asl dependency would solve this problem, but it does not. That is, this doesn't work either: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 exclude(org.apache.spark, spark-streaming) As I mentioned originally, the following solved some but not all conflicts: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 excludeAll( ExclusionRule(organization = org.apache.hadoop), ExclusionRule(organization = org.apache.spark, name = spark-streaming) ) (Note that ExclusionRule(organization = org.apache.spark) without the name attribute does not work because that apparently causes it to exclude even spark-streaming-kinesis-asl.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date: Monday, March 16, 2015 at 12:45 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found) If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 My app works fine with sbt run, but I can't seem to get sbt assembly to work without failing with different file contents found errors due to different versions of various packages getting pulled in to the assembly. This only occurs when I've added spark-streaming-kinesis-asl as a dependency. sbt assembly works fine otherwise. Here are the conflicts that I see: com.esotericsoftware.kryo:kryo:2.21 com.esotericsoftware.minlog:minlog:1.2 com.google.guava:guava:15.0 org.apache.spark:spark-network-common_2.10:1.3.0 (Note: The conflict is with javac.sh; why is this even getting included?) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.apache.spark:spark-network-shuffle_2.10:1.3.0 (Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of these conflicts, as I upgraded the dependency *because* I was getting all of these conflicts with the Spark 1.3.0 artifacts from the central repo.) com.amazonaws:aws-java-sdk-s3
Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)
That's probably a good thing to have, so I'll add it, but unfortunately it did not help this issue. It looks like the hadoop-2.4 profile only sets these properties, which don't seem like they would affect anything related to Netty: properties hadoop.version2.4.0/hadoop.version protobuf.version2.5.0/protobuf.version jets3t.version0.9.0/jets3t.version commons.math3.version3.1.1/commons.math3.version avro.mapred.classifierhadoop2/avro.mapred.classifier /properties Thanks, Jonathan Kelly On 3/5/15, 1:09 PM, Patrick Wendell pwend...@gmail.com wrote: You may need to add the -Phadoop-2.4 profile. When building or release packages for Hadoop 2.4 we use the following flags: -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn - Patrick On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote: I confirmed that this has nothing to do with BigTop by running the same mvn command directly in a fresh clone of the Spark package at the v1.2.1 tag. I got the same exact error. Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Kelly, Jonathan Kelly jonat...@amazon.com Date: Thursday, March 5, 2015 at 10:39 AM To: user@spark.apache.org user@spark.apache.org Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library) I'm running into an issue building Spark v1.2.1 (as well as the latest in branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, which is not quite released yet). The build fails in the External Flume Sink subproject with the following error: [INFO] Compiling 5 Scala sources and 3 Java sources to /workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/e xternal/flume-sink/target/scala-2.10/classes... [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing with a stub. [ERROR] error while loading NettyServer, class file '/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1. 7.6.jar(org/apache/avro/ipc/NettyServer.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found It seems like what is happening is that the Netty library is missing at build time, which happens because it is explicitly excluded in the pom.xml (see https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L 42). I attempted removing the exclusions and the explicit re-add for the test scope on lines 77-88, and that allowed the build to succeed, though I don't know if that will cause problems at runtime. I don't have any experience with the Flume Sink, so I don't really know how to test it. (And, to be clear, I'm not necessarily trying to get the Flume Sink to work-- I just want the project to build successfully, though of course I'd still want the Flume Sink to work for whomever does need it.) Does anybody have any idea what's going on here? Here is the command BigTop is running to build Spark: mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 -Duser.home=/home/ec2-user -Drepo.maven.org= -Dreactor.repo=file:///home/ec2-user/.m2/repository -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 -DskipTests -DrecompileMode=all install As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error. I was not getting the error with Spark v1.1.0, though there weren't any changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1. ~ Jonathan Kelly - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)
I confirmed that this has nothing to do with BigTop by running the same mvn command directly in a fresh clone of the Spark package at the v1.2.1 tag. I got the same exact error. Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Kelly, Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Thursday, March 5, 2015 at 10:39 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library) I'm running into an issue building Spark v1.2.1 (as well as the latest in branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, which is not quite released yet). The build fails in the External Flume Sink subproject with the following error: [INFO] Compiling 5 Scala sources and 3 Java sources to /workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes... [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing with a stub. [ERROR] error while loading NettyServer, class file '/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found It seems like what is happening is that the Netty library is missing at build time, which happens because it is explicitly excluded in the pom.xml (see https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42). I attempted removing the exclusions and the explicit re-add for the test scope on lines 77-88, and that allowed the build to succeed, though I don't know if that will cause problems at runtime. I don't have any experience with the Flume Sink, so I don't really know how to test it. (And, to be clear, I'm not necessarily trying to get the Flume Sink to work-- I just want the project to build successfully, though of course I'd still want the Flume Sink to work for whomever does need it.) Does anybody have any idea what's going on here? Here is the command BigTop is running to build Spark: mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 -Duser.home=/home/ec2-user -Drepo.maven.org= -Dreactor.repo=file:///home/ec2-user/.m2/repository -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 -DskipTests -DrecompileMode=all install As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error. I was not getting the error with Spark v1.1.0, though there weren't any changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1. ~ Jonathan Kelly
Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)
I'm running into an issue building Spark v1.2.1 (as well as the latest in branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, which is not quite released yet). The build fails in the External Flume Sink subproject with the following error: [INFO] Compiling 5 Scala sources and 3 Java sources to /workspace/workspace/bigtop.spark-rpm/build/spark/rpm/BUILD/spark-1.3.0/external/flume-sink/target/scala-2.10/classes... [WARNING] Class org.jboss.netty.channel.ChannelFactory not found - continuing with a stub. [ERROR] error while loading NettyServer, class file '/home/ec2-user/.m2/repository/org/apache/avro/avro-ipc/1.7.6/avro-ipc-1.7.6.jar(org/apache/avro/ipc/NettyServer.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found It seems like what is happening is that the Netty library is missing at build time, which happens because it is explicitly excluded in the pom.xml (see https://github.com/apache/spark/blob/v1.2.1/external/flume-sink/pom.xml#L42). I attempted removing the exclusions and the explicit re-add for the test scope on lines 77-88, and that allowed the build to succeed, though I don't know if that will cause problems at runtime. I don't have any experience with the Flume Sink, so I don't really know how to test it. (And, to be clear, I'm not necessarily trying to get the Flume Sink to work-- I just want the project to build successfully, though of course I'd still want the Flume Sink to work for whomever does need it.) Does anybody have any idea what's going on here? Here is the command BigTop is running to build Spark: mvn -Pbigtop-dist -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Divy.home=/home/ec2-user/.ivy2 -Dsbt.ivy.home=/home/ec2-user/.ivy2 -Duser.home=/home/ec2-user -Drepo.maven.org= -Dreactor.repo=file:///home/ec2-user/.m2/repository -Dhadoop.version=2.4.0-amzn-3-SNAPSHOT -Dyarn.version=2.4.0-amzn-3-SNAPSHOT -Dprotobuf.version=2.5.0 -Dscala.version=2.10.3 -Dscala.binary.version=2.10 -DskipTests -DrecompileMode=all install As I mentioned above, if I switch to the latest in branch-1.2, to v1.3.0-rc2, or to the latest in branch-1.3, I get the same exact error. I was not getting the error with Spark v1.1.0, though there weren't any changes to the external/flume-sink/pom.xml between v1.1.0 and v1.2.1. ~ Jonathan Kelly
Re: kinesis multiple records adding into stream
Are you referring to the PutRecords method, which was added in 1.9.9? (See http://aws.amazon.com/releasenotes/1369906126177804) If so, can't you just depend upon this later version of the SDK in your app even though spark-streaming-kinesis-asl is depending upon this earlier 1.9.3 version that does not yet have the PutRecords call? Also, could you please explain your use case more fully? spark-streaming-kinesis-asl is for *reading* data from Kinesis in Spark, not for writing data, so I would expect that the part of your code that would be writing to Kinesis would be a totally separate app anyway (unless you are reading from Kinesis using spark-streaming-kinesis-asl, transforming it somehow, then writing it back out to Kinesis). ~ Jonathan From: Aniket Bhatnagar aniket.bhatna...@gmail.commailto:aniket.bhatna...@gmail.com Date: Friday, January 16, 2015 at 9:13 AM To: Hafiz Mujadid hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: kinesis multiple records adding into stream Sorry. I couldn't understand the issue. Are you trying to send data to kinesis from a spark batch/real time job? - Aniket On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote: Hi Experts! I am using kinesis dependency as follow groupId = org.apache.spark artifactId = spark-streaming-kinesis-asl_2.10 version = 1.2.0 in this aws sdk version 1.8.3 is being used. in this sdk multiple records can not be put in a single request. is it possible to put multiple records in a single request ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kinesis-multiple-records-adding-into-stream-tp21191.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue. Glad to find that this should be fixed in 1.2.0! I'll give that a try later. Thanks a lot, Jonathan From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Thursday, November 27, 2014 at 4:37 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2. Thanks, Yin On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table
SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala sqlContext.jsonFile(file:///home/hadoop/test.json, sqlContext.table(test).schema).insertInto(test) scala sqlContext.sql(select * from test).collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks, Jonathan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala sqlContext.jsonFile(file:///home/hadoop/test.json, sqlContext.table(test).schema).insertInto(test) scala sqlContext.sql(select * from test).collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks, Jonathan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org