AbstractMethodError
I write a example MyWordCount , just set spark.akka.frameSize larger than default . but when I run this jar , there is a problem : 13/12/19 18:53:48 INFO ClusterTaskSetManager: Lost TID 0 (task 0.0:0) 13/12/19 18:53:48 INFO ClusterTaskSetManager: Loss was due to java.lang.AbstractMethodError java.lang.AbstractMethodError: org.apache.spark.api.java.function.WrappedFunction1.call(Ljava/lang/Object;)Ljava/lang/Object; at org.apache.spark.api.java.function.WrappedFunction1.apply(WrappedFunction1.scala:31) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90) at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.Iterator$$anon$21.foreach(Iterator.scala:437) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250) at scala.collection.Iterator$$anon$21.toBuffer(Iterator.scala:437) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237) at scala.collection.Iterator$$anon$21.toArray(Iterator.scala:437) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) it caused by this code : JavaRDDString words = lines.flatMap(new FlatMapFunctionString, String() { public IterableString call(String s) { return Arrays.asList(s.split( )); } }); there is the parent class: private[spark] abstract class WrappedFunction1[T, R] extends AbstractFunction1[T, R] { @throws(classOf[Exception]) def call(t: T): R final def apply(t: T): R = call(t) } my code is same as the JavaWordCount , I don't know what's the error . Thanks Leo leosand...@gmail.com
Re: AbstractMethodError
Leo, Which version Spark are you used? It was caused compiled by Scala-2.10. Spark-0.8-x using scala-2.9, so you must use the same major version to compile spark code. On Mon, Dec 23, 2013 at 4:00 PM, leosand...@gmail.com leosand...@gmail.comwrote: I write a example MyWordCount , just set spark.akka.frameSize larger than default . but when I run this jar , there is a problem : 13/12/19 18:53:48 INFO ClusterTaskSetManager: Lost TID 0 (task 0.0:0) 13/12/19 18:53:48 INFO ClusterTaskSetManager: Loss was due to java.lang.AbstractMethodError java.lang.AbstractMethodError: org.apache.spark.api.java.function.WrappedFunction1.call(Ljava/lang/Object;)Ljava/lang/Object; at org.apache.spark.api.java.function.WrappedFunction1.apply(WrappedFunction1.scala:31) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90) at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90) at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.Iterator$$anon$21.foreach(Iterator.scala:437) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250) at scala.collection.Iterator$$anon$21.toBuffer(Iterator.scala:437) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237) at scala.collection.Iterator$$anon$21.toArray(Iterator.scala:437) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) it caused by this code : JavaRDDString words = lines.flatMap(new FlatMapFunctionString, String() { public IterableString call(String s) { return Arrays.asList(s.split( )); } }); there is the parent class: private[spark] abstract class WrappedFunction1[T, R] extends AbstractFunction1[T, R] { @throws(classOf[Exception]) def call(t: T): R final def apply(t: T): R = call(t) } my code is same as the JavaWordCount , I don't know what's the error . Thanks Leo -- leosand...@gmail.com
Re: ADD_JARS and jar dependencies in sbt
In your own project, use something like the sbt-assembly plugin to build a jar of your code and all of it's dependencies. Once you have that, use ADD_JARS to add that jar alone and you should be set. On Mon, Dec 23, 2013 at 7:29 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, It seems ADD_JARS can be used to add some jars to class path of spark-shell. This works in simple cases of a few jars. But what happens when those jars depend on other jars? Do we have to list them in ADD_JARS too? Also, do we have to manually download the jars and keep them in parallel with sbt? Our spark app already uses sbt to maintain dependencies. Is there a way to tell spark-shell to use the jars downloaded by sbt?
Re: ADD_JARS doubt.!!!!!
I would not recommend putting your text files in via ADD_JARS. The better thing to do is to put those files in HDFS or locally on your driver server, load them into memory and then use Spark's broadcast variable concept to spread the data out across the cluster. On Mon, Dec 23, 2013 at 1:57 AM, Archit Thakur archit279tha...@gmail.comwrote: Hi, What does the parameter add_jars in the sc constructor exactly do? Does it add all the files to the classpath of worker JVM? I have some text files that I read data from while processing. Can I add it in add jars so that it doesn't have to read it again from HDFS and read from local (Something like Distributed Cache in Hadoop Mapreduce). What path would I read it from? Thanks and Regards, Archit Thakur.
Deploy my application on spark cluster
Hi, I have scenario where kafka is going to be input source for data. So how can I deploy my application which is having all logic for transforming kafka input stream. But I am little bit confused about usage of spark in cluster mode. After running spark in cluster mode, I want to deploy my application on cluster so for that why do I need to run one more java application running forever ? Is it possible to deploy my application jar on the cluster and running only master/slave processes ? I am not sure if I make any sense. Thanks Pankaj
Re: debugging NotSerializableException while using Kryo
Thanks Imran. I tried setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer and now end up seeing NullPointerException when the executor starts up. This is a snippet of the executor's log. Notice how registered TileIdWritable and registered ArgWritable is called, so I see that my KryoRegistrator is being called. However, it's not clear why there's a follow-on NPE. My spark log level is set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if there s some other way to get the executor to be more verbose as to the cause of the NPE. When I take out the spark.closure.serializer setting (i.e., go back to the default Java serialization), the executors start up fine, and executes other RDD actions, but of course not the lookup action (my original problem). With the spark.closure.serializer setting to kryo, it dies with an NPE during executor startup. 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver: akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered with driver 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started 13/12/23 11:00:36 INFO Remoting: Starting remoting 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9 MB. 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs '/tmp' 13/12/23 11:00:36 INFO DiskStore: Created local directory at /tmp/spark-local-20131223110036-4335 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with id = ConnectionManagerId([redacted],41617) 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator registered TileIdWritable registered TileIdWritable registered TileIdWritable registered TileIdWritable registered ArgWritable registered ArgWritable registered ArgWritable registered ArgWritable 13/12/23 11:00:37 INFO Executor: Running task ID 2 13/12/23 11:00:37 INFO Executor: Running task ID 1 13/12/23 11:00:37 INFO Executor: Running task ID 3 13/12/23 11:00:37 INFO Executor: Running task ID 0 13/12/23 11:00:37 INFO Executor: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with timestamp 1387814434436 13/12/23 11:00:37 INFO Utils: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to /tmp/fetchFileTemp2456419097284083628.tmp 13/12/23 11:00:37 INFO Executor: Adding file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to class loader 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-4,5,main] java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at scala.Option.flatMap(Option.scala:170) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-2,5,main] java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at
Unable to load additional JARs in yarn-client mode
Hi All, For our application we need to use the yarn-client mode featured in 0.8.1. (Yarn 2.0.5) We've successfully executed it both yarn-client and yarn-standalone with our java applications. While in yarn-standalone there is a way to add external JARs - we couldn't find a way to add those in yarn-client. Adding jars in spark context constructor or setting the SPARK_CLASSPATH didn't work as well. Are we missing something? Can you please advise? If it is currently impossible - can you advise a patch / workaround? It is crucial for us to get it working with external dependencies. Many Thanks, Ido - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: failed to compile spark because of the missing packages
Hey Nan, You shouldn't copy lib_managed manually. SBT will deal with that. Try just using the same .gitignore settings that we have in the spark github. Seems like you are accidentally including some files that cause this to get messed up. - Patrick On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all I just downloaded spark 0.8.1, made some modification, and compile in my laptop, everything works fine I sync the source code directory with my desktop via github (ignore all .jars and target), and then I copied lib-managed directory to my desktop I tried to compile with sbt. It throws out the following errors: Can any one tell me what can be the reason of these errors? Thank you very much! [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12: object jline is not a member of package tools [error] import scala.tools.jline.console.completer._ [error]^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11: object jline is not a member of package tools [error] import scala.tools.jline._ [error]^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] else try SparkJLineReader( [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] repl.in = SparkJLineReader(repl) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258: not found: value kafka [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams, topics, storageLevel) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269: not found: value kafka [error] def kafkaStream[T: ClassManifest, D : kafka.serializer.Decoder[_]: Manifest]( [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27: not found: object kafka [error] import kafka.consumer._ [error]^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274: ambiguous implicit values: [error] both method fallbackStringCanBuildFrom in class LowPriorityImplicits of type [T]= scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]] [error] and value evidence$5 of type Manifest[D] [error] match expected type error [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams, topics, storageLevel) [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32: type error in type pattern error is unchecked since it is eliminated by erasure [warn] catch { case _: MissingRequirementError = None } [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290: value executionFor is not a member of object SparkJLineCompletion.this.ids [error] (ids executionFor parsed) orElse [error]^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373: type error in type pattern error is unchecked since it is eliminated by erasure [warn] case ex: Exception = [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11: object jline is not a member of package tools [error] import scala.tools.jline.console.ConsoleReader [error]^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:24: type error in type pattern error is unchecked since it is eliminated by erasure [warn] catch { case _: Exception = Nil } [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:39: class file needed by ConsoleReaderHelper is missing. [error] reference value jline of package tools refers to nonexisting symbol. [error] class JLineConsoleReader extends ConsoleReader with ConsoleReaderHelper { [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:26: value getTerminal is not a member of SparkJLineReader.this.JLineConsoleReader [error] private def term = consoleReader.getTerminal() [error]
Re: debugging NotSerializableException while using Kryo
maybe try to implement your class with serializable... 2013/12/23 Ameet Kini ameetk...@gmail.com Thanks Imran. I tried setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer and now end up seeing NullPointerException when the executor starts up. This is a snippet of the executor's log. Notice how registered TileIdWritable and registered ArgWritable is called, so I see that my KryoRegistrator is being called. However, it's not clear why there's a follow-on NPE. My spark log level is set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if there s some other way to get the executor to be more verbose as to the cause of the NPE. When I take out the spark.closure.serializer setting (i.e., go back to the default Java serialization), the executors start up fine, and executes other RDD actions, but of course not the lookup action (my original problem). With the spark.closure.serializer setting to kryo, it dies with an NPE during executor startup. 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver: akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered with driver 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started 13/12/23 11:00:36 INFO Remoting: Starting remoting 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9 MB. 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs '/tmp' 13/12/23 11:00:36 INFO DiskStore: Created local directory at /tmp/spark-local-20131223110036-4335 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with id = ConnectionManagerId([redacted],41617) 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator registered TileIdWritable registered TileIdWritable registered TileIdWritable registered TileIdWritable registered ArgWritable registered ArgWritable registered ArgWritable registered ArgWritable 13/12/23 11:00:37 INFO Executor: Running task ID 2 13/12/23 11:00:37 INFO Executor: Running task ID 1 13/12/23 11:00:37 INFO Executor: Running task ID 3 13/12/23 11:00:37 INFO Executor: Running task ID 0 13/12/23 11:00:37 INFO Executor: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with timestamp 1387814434436 13/12/23 11:00:37 INFO Utils: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to /tmp/fetchFileTemp2456419097284083628.tmp 13/12/23 11:00:37 INFO Executor: Adding file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to class loader 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-4,5,main] java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at scala.Option.flatMap(Option.scala:170) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 13/12/23 11:00:37 ERROR Executor: Uncaught exception in
Noob Spark questions
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the Fast data processing with Spark book. One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :) Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers). These handlers can themselves create 0mq sockets to feed data into a pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best. One of the pipelines I am evaluating is Spark. There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS? What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this? What about dumping all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it sees the pipeline but sometimes storing intermediate results helps too... Thanks! OD
Re: failed to compile spark because of the missing packages
Hi, Patrick Thanks for the reply I still failed to compile the code, even I made the following attempts 1. download spark-0.8.1.tgz, 2. decompress, and copy the files to the github local repo directory (.gitignore is just copied from https://github.com/apache/incubator-spark/blob/master/.gitignore) 3. push files to git repo 4. pull files in the desktop 5. sbt/sbt assembly/assembly, failed with the same error as my last email any further comments? Best, -- Nan Zhu On Monday, December 23, 2013 at 12:22 PM, Patrick Wendell wrote: Hey Nan, You shouldn't copy lib_managed manually. SBT will deal with that. Try just using the same .gitignore settings that we have in the spark github. Seems like you are accidentally including some files that cause this to get messed up. - Patrick On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, all I just downloaded spark 0.8.1, made some modification, and compile in my laptop, everything works fine I sync the source code directory with my desktop via github (ignore all .jars and target), and then I copied lib-managed directory to my desktop I tried to compile with sbt. It throws out the following errors: Can any one tell me what can be the reason of these errors? Thank you very much! [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12: object jline is not a member of package tools [error] import scala.tools.jline.console.completer._ [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11: object jline is not a member of package tools [error] import scala.tools.jline._ [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] else try SparkJLineReader( [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] repl.in (http://repl.in) = SparkJLineReader(repl) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258: not found: value kafka [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams, topics, storageLevel) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269: not found: value kafka [error] def kafkaStream[T: ClassManifest, D : kafka.serializer.Decoder[_]: Manifest]( [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27: not found: object kafka [error] import kafka.consumer._ [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274: ambiguous implicit values: [error] both method fallbackStringCanBuildFrom in class LowPriorityImplicits of type [T]= scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]] [error] and value evidence$5 of type Manifest[D] [error] match expected type error [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams, topics, storageLevel) [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32: type error in type pattern error is unchecked since it is eliminated by erasure [warn] catch { case _: MissingRequirementError = None } [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290: value executionFor is not a member of object SparkJLineCompletion.this.ids [error] (ids executionFor parsed) orElse [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373: type error in type pattern error is unchecked since it is eliminated by erasure [warn] case ex: Exception = [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11: object jline is not a member of package tools [error] import scala.tools.jline.console.ConsoleReader [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:24: type error in type pattern error is unchecked since it is eliminated by erasure [warn] catch { case _: Exception = Nil } [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:39: class file needed by ConsoleReaderHelper is missing. [error] reference value jline of package tools refers to nonexisting symbol. [error] class
Re: Noob Spark questions
I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement. I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements. And MongoDB or S3 are also doable, at least with Java APIs, I suppose. 2013/12/23 Ognen Duzlevski og...@nengoiksvelzud.com Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the Fast data processing with Spark book. One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :) Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers). These handlers can themselves create 0mq sockets to feed data into a pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best. One of the pipelines I am evaluating is Spark. There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS? What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this? What about dumping all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it sees the pipeline but sometimes storing intermediate results helps too... Thanks! OD
Re: failed to compile spark because of the missing packages
I finally solved the issue manually I found that when I compile with sbt, lib/ directory under streaming/ and repl/ is missing, The reason is that in the official .gitignore, it intends to ignore the “lib/“, while in the distributed tgz files, these two lib/ directories are included…. Best, -- Nan Zhu On Monday, December 23, 2013 at 4:12 PM, Nan Zhu wrote: Hi, Patrick Thanks for the reply I still failed to compile the code, even I made the following attempts 1. download spark-0.8.1.tgz, 2. decompress, and copy the files to the github local repo directory (.gitignore is just copied from https://github.com/apache/incubator-spark/blob/master/.gitignore) 3. push files to git repo 4. pull files in the desktop 5. sbt/sbt assembly/assembly, failed with the same error as my last email any further comments? Best, -- Nan Zhu On Monday, December 23, 2013 at 12:22 PM, Patrick Wendell wrote: Hey Nan, You shouldn't copy lib_managed manually. SBT will deal with that. Try just using the same .gitignore settings that we have in the spark github. Seems like you are accidentally including some files that cause this to get messed up. - Patrick On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, all I just downloaded spark 0.8.1, made some modification, and compile in my laptop, everything works fine I sync the source code directory with my desktop via github (ignore all .jars and target), and then I copied lib-managed directory to my desktop I tried to compile with sbt. It throws out the following errors: Can any one tell me what can be the reason of these errors? Thank you very much! [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12: object jline is not a member of package tools [error] import scala.tools.jline.console.completer._ [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11: object jline is not a member of package tools [error] import scala.tools.jline._ [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] else try SparkJLineReader( [error] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012: type mismatch; [error] found : org.apache.spark.repl.SparkJLineReader [error] required: scala.tools.nsc.interpreter.InteractiveReader [error] repl.in (http://repl.in) = SparkJLineReader(repl) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258: not found: value kafka [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams, topics, storageLevel) [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269: not found: value kafka [error] def kafkaStream[T: ClassManifest, D : kafka.serializer.Decoder[_]: Manifest]( [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27: not found: object kafka [error] import kafka.consumer._ [error] ^ [error] /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274: ambiguous implicit values: [error] both method fallbackStringCanBuildFrom in class LowPriorityImplicits of type [T]= scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]] [error] and value evidence$5 of type Manifest[D] [error] match expected type error [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams, topics, storageLevel) [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32: type error in type pattern error is unchecked since it is eliminated by erasure [warn] catch { case _: MissingRequirementError = None } [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290: value executionFor is not a member of object SparkJLineCompletion.this.ids [error] (ids executionFor parsed) orElse [error] ^ [warn] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373: type error in type pattern error is unchecked since it is eliminated by erasure [warn] case ex: Exception = [warn] ^ [error] /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11: object jline is not a member of package tools [error] import
Re: debugging NotSerializableException while using Kryo
Using Java serialization would make the NPE go away, but it would be a less preferable solution. My application is network-intensive, and serialization cost is significant. In other words, these objects are ideal candidates for Kryo. On Mon, Dec 23, 2013 at 3:41 PM, Jie Deng deng113...@gmail.com wrote: maybe try to implement your class with serializable... 2013/12/23 Ameet Kini ameetk...@gmail.com Thanks Imran. I tried setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer and now end up seeing NullPointerException when the executor starts up. This is a snippet of the executor's log. Notice how registered TileIdWritable and registered ArgWritable is called, so I see that my KryoRegistrator is being called. However, it's not clear why there's a follow-on NPE. My spark log level is set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if there s some other way to get the executor to be more verbose as to the cause of the NPE. When I take out the spark.closure.serializer setting (i.e., go back to the default Java serialization), the executors start up fine, and executes other RDD actions, but of course not the lookup action (my original problem). With the spark.closure.serializer setting to kryo, it dies with an NPE during executor startup. 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver: akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered with driver 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started 13/12/23 11:00:36 INFO Remoting: Starting remoting 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/] 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9 MB. 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs '/tmp' 13/12/23 11:00:36 INFO DiskStore: Created local directory at /tmp/spark-local-20131223110036-4335 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with id = ConnectionManagerId([redacted],41617) 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator registered TileIdWritable registered TileIdWritable registered TileIdWritable registered TileIdWritable registered ArgWritable registered ArgWritable registered ArgWritable registered ArgWritable 13/12/23 11:00:37 INFO Executor: Running task ID 2 13/12/23 11:00:37 INFO Executor: Running task ID 1 13/12/23 11:00:37 INFO Executor: Running task ID 3 13/12/23 11:00:37 INFO Executor: Running task ID 0 13/12/23 11:00:37 INFO Executor: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with timestamp 1387814434436 13/12/23 11:00:37 INFO Utils: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to /tmp/fetchFileTemp2456419097284083628.tmp 13/12/23 11:00:37 INFO Executor: Adding file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to class loader 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-4,5,main] java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at scala.Option.flatMap(Option.scala:170) at
Re: debugging NotSerializableException while using Kryo
What spark version are you using? By looking at the code Executor.scala line195, you will at least know what cause the NPE. We can start from there. On Dec 23, 2013, at 10:21 AM, Ameet Kini ameetk...@gmail.com wrote: Thanks Imran. I tried setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer and now end up seeing NullPointerException when the executor starts up. This is a snippet of the executor's log. Notice how registered TileIdWritable and registered ArgWritable is called, so I see that my KryoRegistrator is being called. However, it's not clear why there's a follow-on NPE. My spark log level is set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if there s some other way to get the executor to be more verbose as to the cause of the NPE. When I take out the spark.closure.serializer setting (i.e., go back to the default Java serialization), the executors start up fine, and executes other RDD actions, but of course not the lookup action (my original problem). With the spark.closure.serializer setting to kryo, it dies with an NPE during executor startup. 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver: akka.tcp://[redacted]:48147/user/StandaloneScheduler 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered with driver 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started 13/12/23 11:00:36 INFO Remoting: Starting remoting 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:/[redacted]:56483] 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[redacted]:56483] 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://[redacted]:48147/user/BlockManagerMaster 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9 MB. 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs '/tmp' 13/12/23 11:00:36 INFO DiskStore: Created local directory at /tmp/spark-local-20131223110036-4335 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with id = ConnectionManagerId([redacted],41617) 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp:/[redacted]:48147/user/MapOutputTracker 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: geotrellis.spark.KryoRegistrator registered TileIdWritable registered TileIdWritable registered TileIdWritable registered TileIdWritable registered ArgWritable registered ArgWritable registered ArgWritable registered ArgWritable 13/12/23 11:00:37 INFO Executor: Running task ID 2 13/12/23 11:00:37 INFO Executor: Running task ID 1 13/12/23 11:00:37 INFO Executor: Running task ID 3 13/12/23 11:00:37 INFO Executor: Running task ID 0 13/12/23 11:00:37 INFO Executor: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with timestamp 1387814434436 13/12/23 11:00:37 INFO Utils: Fetching http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to /tmp/fetchFileTemp2456419097284083628.tmp 13/12/23 11:00:37 INFO Executor: Adding file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to class loader 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-4,5,main] java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195) at scala.Option.flatMap(Option.scala:170) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread Thread[pool-7-thread-2,5,main] java.lang.NullPointerException at
Re: Noob Spark questions
Though there is a saying that Java in Spark is slower than Scala shell That shouldn't be said. The Java API is mostly a thin wrapper of the Scala implementation, and the performance of the Java API is intended to be equivalent to that of the Scala API. If you're finding that not to be true, then that is something that the Spark developers would like to know. On Mon, Dec 23, 2013 at 1:23 PM, Jie Deng deng113...@gmail.com wrote: I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement. I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements. And MongoDB or S3 are also doable, at least with Java APIs, I suppose. 2013/12/23 Ognen Duzlevski og...@nengoiksvelzud.com Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the Fast data processing with Spark book. One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't mind, Scala seems to be a lovely language! :) Anyways, I would like to set up a data analysis pipeline where I have already done the job of exposing a port on the internet (amazon elastic load balancer) that feeds real-time data from tens-hundreds of thousands of clients in real-time into a set of internal instances which are essentially zeroMQ sockets (I do this via mongrel2 and associated handlers). These handlers can themselves create 0mq sockets to feed data into a pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best. One of the pipelines I am evaluating is Spark. There seems to be information on Spark but for some reason I find it to be very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't use Hadoop/HDFS? What do people do when they want to inhale real-time information? Let's say I want to use 0mq. Does Spark allow for that? How would I go about doing this? What about dumping all the data into a persistent store? Can I dump into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can do that upon receipt of data before it sees the pipeline but sometimes storing intermediate results helps too... Thanks! OD
Re: Noob Spark questions
Hello, On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng deng113...@gmail.com wrote: I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is slower than Scala shell, well, depends on your requirement. I am not an expert in Spark, but as far as I know, Spark provide different level of storage including memory or disk. And for the disk part, HDFS is just a choice. I am not using hdfs myself, but you will loss the benefit of hdfs as well. In other words, it's also just based on your requirements. And MongoDB or S3 are also doable, at least with Java APIs, I suppose. I guess that answers the question of whether it is doable. Where/how do I find out how it is doable? :) I am guessing every pipeline is a custom job of sorts - hence it is the developer's job to write the connectors to 0mq or dynamodb, for example? Or? Is there some kind of a plug in system for Spark? Thanks!
mapPartitions versus map overhead?
Hi all, is there any overhead of mapPartitions versus overhead, if I implement an algorithm using map - reduce versus mapPartitions - reduce. Thanks, Huan Dao
Re: Unable to load additional JARs in yarn-client mode
I’m surprised by this, but one way that will definitely work is to assemble your application into a single JAR. If passing them to the constructor doesn’t work, that’s probably a bug. Matei On Dec 23, 2013, at 12:03 PM, Karavany, Ido ido.karav...@intel.com wrote: Hi All, For our application we need to use the yarn-client mode featured in 0.8.1. (Yarn 2.0.5) We’ve successfully executed it both yarn-client and yarn-standalone with our java applications. While in yarn-standalone there is a way to add external JARs – we couldn’t find a way to add those in yarn-client. Adding jars in spark context constructor or setting the SPARK_CLASSPATH didn’t work as well. Are we missing something? Can you please advise? If it is currently impossible – can you advise a patch / workaround? It is crucial for us to get it working with external dependencies. Many Thanks, Ido - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
RE: Unable to load additional JARs in yarn-client mode
Ido, when you say add external JARS, do you mean by -addJars which adding some jar for SparkContext to use in the AM env? If so, I think you don't need it for yarn-cilent mode at all, for yarn-client mode, SparkContext running locally, I think you just need to make sure those jars are in the java classpath. And for those need by executors / tasks, I think , you can package it as Matei said. Or maybe we can expose some env for yarn-client mode to allowing adding multiple jars as needed. Best Regards, Raymond Liu From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Tuesday, December 24, 2013 1:17 PM To: user@spark.incubator.apache.org Subject: Re: Unable to load additional JARs in yarn-client mode I'm surprised by this, but one way that will definitely work is to assemble your application into a single JAR. If passing them to the constructor doesn't work, that's probably a bug. Matei On Dec 23, 2013, at 12:03 PM, Karavany, Ido ido.karav...@intel.commailto:ido.karav...@intel.com wrote: Hi All, For our application we need to use the yarn-client mode featured in 0.8.1. (Yarn 2.0.5) We've successfully executed it both yarn-client and yarn-standalone with our java applications. While in yarn-standalone there is a way to add external JARs - we couldn't find a way to add those in yarn-client. Adding jars in spark context constructor or setting the SPARK_CLASSPATH didn't work as well. Are we missing something? Can you please advise? If it is currently impossible - can you advise a patch / workaround? It is crucial for us to get it working with external dependencies. Many Thanks, Ido - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.