AbstractMethodError

2013-12-23 Thread leosand...@gmail.com
I write a example MyWordCount , just set spark.akka.frameSize larger than 
default . but when I run this jar , there is a problem :

13/12/19 18:53:48 INFO ClusterTaskSetManager: Lost TID 0 (task 0.0:0)
13/12/19 18:53:48 INFO ClusterTaskSetManager: Loss was due to 
java.lang.AbstractMethodError
java.lang.AbstractMethodError: 
org.apache.spark.api.java.function.WrappedFunction1.call(Ljava/lang/Object;)Ljava/lang/Object;
at 
org.apache.spark.api.java.function.WrappedFunction1.apply(WrappedFunction1.scala:31)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.Iterator$$anon$21.foreach(Iterator.scala:437)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
at scala.collection.Iterator$$anon$21.toBuffer(Iterator.scala:437)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
at scala.collection.Iterator$$anon$21.toArray(Iterator.scala:437)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)

it caused by  this code :
JavaRDDString words = lines.flatMap(new FlatMapFunctionString, String() {
public IterableString call(String s) {
return Arrays.asList(s.split( ));
} });

there is the parent class:

private[spark] abstract class WrappedFunction1[T, R] extends 
AbstractFunction1[T, R] {
  @throws(classOf[Exception])
  def call(t: T): R

  final def apply(t: T): R = call(t)
}
 
my code is same as the JavaWordCount , I don't know what's the error .

Thanks 

Leo




leosand...@gmail.com

Re: AbstractMethodError

2013-12-23 Thread Azuryy Yu
Leo,
Which version Spark are you used? It was caused compiled by Scala-2.10.

Spark-0.8-x using scala-2.9, so you must use the same major version to
compile spark code.


On Mon, Dec 23, 2013 at 4:00 PM, leosand...@gmail.com
leosand...@gmail.comwrote:

   I write a example MyWordCount , just set spark.akka.frameSize larger
 than default . but when I run this jar , there is a problem :

  13/12/19 18:53:48 INFO ClusterTaskSetManager: Lost TID 0 (task 0.0:0)
 13/12/19 18:53:48 INFO ClusterTaskSetManager: Loss was due to
 java.lang.AbstractMethodError
 java.lang.AbstractMethodError:
 org.apache.spark.api.java.function.WrappedFunction1.call(Ljava/lang/Object;)Ljava/lang/Object;
 at
 org.apache.spark.api.java.function.WrappedFunction1.apply(WrappedFunction1.scala:31)
 at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90)
 at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:90)
 at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
 at scala.collection.Iterator$class.foreach(Iterator.scala:772)
 at scala.collection.Iterator$$anon$21.foreach(Iterator.scala:437)
 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
 at scala.collection.Iterator$$anon$21.toBuffer(Iterator.scala:437)
 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
 at scala.collection.Iterator$$anon$21.toArray(Iterator.scala:437)
 at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560)
 at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:560)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)
 at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)

 it caused by  this code :
  JavaRDDString words = lines.flatMap(new FlatMapFunctionString,
 String() {
 public IterableString call(String s) {
 return Arrays.asList(s.split( ));
 } });

 there is the parent class:

  private[spark] abstract class WrappedFunction1[T, R] extends
 AbstractFunction1[T, R] {
   @throws(classOf[Exception])
   def call(t: T): R

   final def apply(t: T): R = call(t)
 }

 my code is same as the JavaWordCount , I don't know what's the error .

 Thanks

 Leo

 --
  leosand...@gmail.com



Re: ADD_JARS and jar dependencies in sbt

2013-12-23 Thread Gary Malouf
In your own project, use something like the sbt-assembly plugin to build a
jar of your code and all of it's dependencies.  Once you have that, use
ADD_JARS to add that jar alone and you should be set.


On Mon, Dec 23, 2013 at 7:29 AM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 It seems ADD_JARS can be used to add some jars to class path of
 spark-shell. This works in simple cases of a few jars. But what happens
 when those jars depend on other jars? Do we have to list them in ADD_JARS
 too?

 Also, do we have to manually download the jars and keep them in parallel
 with sbt? Our spark app already uses sbt to maintain dependencies. Is there
 a way to tell spark-shell to use the jars downloaded by sbt?



Re: ADD_JARS doubt.!!!!!

2013-12-23 Thread Gary Malouf
I would not recommend putting your text files in via ADD_JARS.  The better
thing to do is to put those files in HDFS or locally on your driver server,
load them into memory and then use Spark's broadcast variable concept to
spread the data out across the cluster.


On Mon, Dec 23, 2013 at 1:57 AM, Archit Thakur archit279tha...@gmail.comwrote:

 Hi,

 What does the parameter add_jars in the sc constructor exactly do?
 Does it add all the files to the classpath of worker JVM?

 I have some text files that I read data from while processing.
 Can I add it in add jars so that it doesn't have to read it again from
 HDFS and read from local (Something like Distributed Cache in Hadoop
 Mapreduce). What path would I read it from?

 Thanks and Regards,
 Archit Thakur.



Deploy my application on spark cluster

2013-12-23 Thread Pankaj Mittal
Hi,

I have scenario where kafka is going to be input source for data. So how
can I deploy my application which is having all logic for transforming
kafka input stream.

But I am little bit confused about usage of spark in cluster mode. After
running spark in cluster mode, I want to deploy my application on cluster
so for that why do I need to run one more java application running forever
? Is it possible to deploy my application jar on the cluster and running
only master/slave processes ? I am not sure if I make any sense.

Thanks
Pankaj


Re: debugging NotSerializableException while using Kryo

2013-12-23 Thread Ameet Kini
Thanks Imran.

I tried setting spark.closure.serializer to
org.apache.spark.serializer.KryoSerializer and now end up seeing
NullPointerException when the executor starts up. This is a snippet of the
executor's log. Notice how registered TileIdWritable and registered
ArgWritable is called, so I see that my KryoRegistrator is being called.
However, it's not clear why there's a follow-on NPE. My spark log level is
set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if
there s
some other way to get the executor to be more verbose as to the cause of
the NPE.

When I take out the spark.closure.serializer setting (i.e., go back to the
default Java serialization), the executors start up fine, and executes
other RDD actions, but of course not the lookup action (my original
problem). With the spark.closure.serializer setting to kryo, it dies with
an NPE during executor startup.


13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver:
akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler
13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered
with driver
13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started
13/12/23 11:00:36 INFO Remoting: Starting remoting
13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster:
akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster
13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9
MB.
13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs
'/tmp'
13/12/23 11:00:36 INFO DiskStore: Created local directory at
/tmp/spark-local-20131223110036-4335
13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with
id = ConnectionManagerId([redacted],41617)
13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager
13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager
13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker:
akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker
13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a
13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0
13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1
13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2
13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3
13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
geotrellis.spark.KryoRegistrator
13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
geotrellis.spark.KryoRegistrator
13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
geotrellis.spark.KryoRegistrator
13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
geotrellis.spark.KryoRegistrator
registered TileIdWritable
registered TileIdWritable
registered TileIdWritable
registered TileIdWritable
registered ArgWritable
registered ArgWritable
registered ArgWritable
registered ArgWritable
13/12/23 11:00:37 INFO Executor: Running task ID 2
13/12/23 11:00:37 INFO Executor: Running task ID 1
13/12/23 11:00:37 INFO Executor: Running task ID 3
13/12/23 11:00:37 INFO Executor: Running task ID 0
13/12/23 11:00:37 INFO Executor: Fetching
http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with
timestamp 1387814434436
13/12/23 11:00:37 INFO Utils: Fetching
http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to
/tmp/fetchFileTemp2456419097284083628.tmp
13/12/23 11:00:37 INFO Executor: Adding
file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar
to class loader
13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread
Thread[pool-7-thread-4,5,main]
java.lang.NullPointerException
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
at scala.Option.flatMap(Option.scala:170)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread
Thread[pool-7-thread-2,5,main]
java.lang.NullPointerException
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
at

Unable to load additional JARs in yarn-client mode

2013-12-23 Thread Karavany, Ido
Hi All,

For our application we need to use the yarn-client mode featured in 0.8.1. 
(Yarn 2.0.5)
We've successfully executed it both yarn-client and yarn-standalone with our 
java applications.

While in yarn-standalone there is a way to add external JARs - we couldn't find 
a way to add those in  yarn-client.

Adding jars in spark context constructor or setting the SPARK_CLASSPATH didn't 
work as well.

Are we missing something?
Can you please advise?
If it is currently impossible - can you advise a patch / workaround?

It is crucial for us to get it working with external dependencies.

Many Thanks,
Ido


-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: failed to compile spark because of the missing packages

2013-12-23 Thread Patrick Wendell
Hey Nan,

You shouldn't copy lib_managed manually. SBT will deal with that. Try
just using the same .gitignore settings that we have in the spark
github. Seems like you are accidentally including some files that
cause this to get messed up.

- Patrick

On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, all

 I just downloaded spark 0.8.1, made some modification, and compile in my
 laptop, everything works fine

 I sync the source code directory with my desktop via github (ignore all
 .jars and target), and then I copied lib-managed directory to my desktop

 I tried to compile with sbt. It throws out the following errors:

 Can any one tell me what can be the reason of these errors?

 Thank you very much!


 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12:
 object jline is not a member of package tools
 [error] import scala.tools.jline.console.completer._
 [error]^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11:
 object jline is not a member of package tools
 [error] import scala.tools.jline._
 [error]^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819:
 type mismatch;
 [error]  found   : org.apache.spark.repl.SparkJLineReader
 [error]  required: scala.tools.nsc.interpreter.InteractiveReader
 [error] else try SparkJLineReader(
 [error]  ^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012:
 type mismatch;
 [error]  found   : org.apache.spark.repl.SparkJLineReader
 [error]  required: scala.tools.nsc.interpreter.InteractiveReader
 [error] repl.in = SparkJLineReader(repl)
 [error]   ^
 [error]
 /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258:
 not found: value kafka
 [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams,
 topics, storageLevel)
 [error] ^
 [error]
 /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269:
 not found: value kafka
 [error]   def kafkaStream[T: ClassManifest, D :
 kafka.serializer.Decoder[_]: Manifest](
 [error]  ^
 [error]
 /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27:
 not found: object kafka
 [error] import kafka.consumer._
 [error]^
 [error]
 /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274:
 ambiguous implicit values:
 [error]  both method fallbackStringCanBuildFrom in class
 LowPriorityImplicits of type [T]=
 scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]]
 [error]  and value evidence$5 of type Manifest[D]
 [error]  match expected type error
 [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams,
 topics, storageLevel)
 [error]   ^
 [warn]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32:
 type error in type pattern error is unchecked since it is eliminated by
 erasure
 [warn] catch { case _: MissingRequirementError = None }
 [warn] ^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290:
 value executionFor is not a member of object SparkJLineCompletion.this.ids
 [error]   (ids executionFor parsed) orElse
 [error]^
 [warn]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373:
 type error in type pattern error is unchecked since it is eliminated by
 erasure
 [warn] case ex: Exception =
 [warn]  ^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11:
 object jline is not a member of package tools
 [error] import scala.tools.jline.console.ConsoleReader
 [error]^
 [warn]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:24:
 type error in type pattern error is unchecked since it is eliminated by
 erasure
 [warn] catch { case _: Exception = Nil }
 [warn] ^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:39:
 class file needed by ConsoleReaderHelper is missing.
 [error] reference value jline of package tools refers to nonexisting symbol.
 [error]   class JLineConsoleReader extends ConsoleReader with
 ConsoleReaderHelper {
 [error]   ^
 [error]
 /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:26:
 value getTerminal is not a member of
 SparkJLineReader.this.JLineConsoleReader
 [error]   private def term = consoleReader.getTerminal()
 [error]  

Re: debugging NotSerializableException while using Kryo

2013-12-23 Thread Jie Deng
maybe try to implement your class with serializable...


2013/12/23 Ameet Kini ameetk...@gmail.com

 Thanks Imran.

 I tried setting spark.closure.serializer to
 org.apache.spark.serializer.KryoSerializer and now end up seeing
 NullPointerException when the executor starts up. This is a snippet of the
 executor's log. Notice how registered TileIdWritable and registered
 ArgWritable is called, so I see that my KryoRegistrator is being called.
 However, it's not clear why there's a follow-on NPE. My spark log level is
 set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if
 there s
 some other way to get the executor to be more verbose as to the cause of
 the NPE.

 When I take out the spark.closure.serializer setting (i.e., go back to the
 default Java serialization), the executors start up fine, and executes
 other RDD actions, but of course not the lookup action (my original
 problem). With the spark.closure.serializer setting to kryo, it dies with
 an NPE during executor startup.


 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver:
 akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered
 with driver
 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started
 13/12/23 11:00:36 INFO Remoting: Starting remoting
 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster:
 akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster
 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity
 323.9 MB.
 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs
 '/tmp'
 13/12/23 11:00:36 INFO DiskStore: Created local directory at
 /tmp/spark-local-20131223110036-4335
 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with
 id = ConnectionManagerId([redacted],41617)
 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager
 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager
 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker:
 akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker
 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 13/12/23 11:00:37 INFO Executor: Running task ID 2
 13/12/23 11:00:37 INFO Executor: Running task ID 1
 13/12/23 11:00:37 INFO Executor: Running task ID 3
 13/12/23 11:00:37 INFO Executor: Running task ID 0
 13/12/23 11:00:37 INFO Executor: Fetching
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with
 timestamp 1387814434436
 13/12/23 11:00:37 INFO Utils: Fetching
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to
 /tmp/fetchFileTemp2456419097284083628.tmp
 13/12/23 11:00:37 INFO Executor: Adding
 file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar
 to class loader
 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread
 Thread[pool-7-thread-4,5,main]
 java.lang.NullPointerException
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at scala.Option.flatMap(Option.scala:170)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 13/12/23 11:00:37 ERROR Executor: Uncaught exception in 

Noob Spark questions

2013-12-23 Thread Ognen Duzlevski
Hello, I am new to Spark and have installed it, played with it a bit,
mostly I am reading through the Fast data processing with Spark book.

One of the first things I realized is that I have to learn Scala, the
real-time data analytics part is not supported by the Python API, correct?
I don't mind, Scala seems to be a lovely language! :)

Anyways, I would like to set up a data analysis pipeline where I have
already done the job of exposing a port on the internet (amazon elastic
load balancer) that feeds real-time data from tens-hundreds of thousands of
clients in real-time into a set of internal instances which are essentially
zeroMQ sockets (I do this via mongrel2 and associated handlers).

These handlers can themselves create 0mq sockets to feed data into a
pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best.

One of the pipelines I am evaluating is Spark.

There seems to be information on Spark but for some reason I find it to be
very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't
use Hadoop/HDFS?

What do people do when they want to inhale real-time information? Let's say
I want to use 0mq. Does Spark allow for that? How would I go about doing
this?

What about dumping all the data into a persistent store? Can I dump into
DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can
do that upon receipt of data before it sees the pipeline but sometimes
storing intermediate results helps too...

Thanks!
OD


Re: failed to compile spark because of the missing packages

2013-12-23 Thread Nan Zhu
Hi, Patrick 

Thanks for the reply

I still failed to compile the code, even I made the following attempts

1. download spark-0.8.1.tgz, 

2. decompress, and copy the files to the github local repo directory 
(.gitignore is just copied from 
https://github.com/apache/incubator-spark/blob/master/.gitignore)

3. push files to git repo

4. pull files in the desktop 

5. sbt/sbt assembly/assembly, failed with the same error as my last email

any further comments?

Best, 

-- 
Nan Zhu


On Monday, December 23, 2013 at 12:22 PM, Patrick Wendell wrote:

 Hey Nan,
 
 You shouldn't copy lib_managed manually. SBT will deal with that. Try
 just using the same .gitignore settings that we have in the spark
 github. Seems like you are accidentally including some files that
 cause this to get messed up.
 
 - Patrick
 
 On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
  
  I just downloaded spark 0.8.1, made some modification, and compile in my
  laptop, everything works fine
  
  I sync the source code directory with my desktop via github (ignore all
  .jars and target), and then I copied lib-managed directory to my desktop
  
  I tried to compile with sbt. It throws out the following errors:
  
  Can any one tell me what can be the reason of these errors?
  
  Thank you very much!
  
  
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12:
  object jline is not a member of package tools
  [error] import scala.tools.jline.console.completer._
  [error] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11:
  object jline is not a member of package tools
  [error] import scala.tools.jline._
  [error] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819:
  type mismatch;
  [error] found : org.apache.spark.repl.SparkJLineReader
  [error] required: scala.tools.nsc.interpreter.InteractiveReader
  [error] else try SparkJLineReader(
  [error] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012:
  type mismatch;
  [error] found : org.apache.spark.repl.SparkJLineReader
  [error] required: scala.tools.nsc.interpreter.InteractiveReader
  [error] repl.in (http://repl.in) = SparkJLineReader(repl)
  [error] ^
  [error]
  /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258:
  not found: value kafka
  [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams,
  topics, storageLevel)
  [error] ^
  [error]
  /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269:
  not found: value kafka
  [error] def kafkaStream[T: ClassManifest, D :
  kafka.serializer.Decoder[_]: Manifest](
  [error] ^
  [error]
  /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27:
  not found: object kafka
  [error] import kafka.consumer._
  [error] ^
  [error]
  /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274:
  ambiguous implicit values:
  [error] both method fallbackStringCanBuildFrom in class
  LowPriorityImplicits of type [T]=
  scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]]
  [error] and value evidence$5 of type Manifest[D]
  [error] match expected type error
  [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams,
  topics, storageLevel)
  [error] ^
  [warn]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32:
  type error in type pattern error is unchecked since it is eliminated by
  erasure
  [warn] catch { case _: MissingRequirementError = None }
  [warn] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290:
  value executionFor is not a member of object SparkJLineCompletion.this.ids
  [error] (ids executionFor parsed) orElse
  [error] ^
  [warn]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373:
  type error in type pattern error is unchecked since it is eliminated by
  erasure
  [warn] case ex: Exception =
  [warn] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11:
  object jline is not a member of package tools
  [error] import scala.tools.jline.console.ConsoleReader
  [error] ^
  [warn]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:24:
  type error in type pattern error is unchecked since it is eliminated by
  erasure
  [warn] catch { case _: Exception = Nil }
  [warn] ^
  [error]
  /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:39:
  class file needed by ConsoleReaderHelper is missing.
  [error] reference value jline of package tools refers to nonexisting symbol.
  [error] class 

Re: Noob Spark questions

2013-12-23 Thread Jie Deng
I am using Java, and Spark has APIs for Java as well. Though there is a
saying that Java in Spark is slower than Scala shell, well, depends on your
requirement.
I am not an expert in Spark, but as far as I know, Spark provide different
level of storage including memory or disk. And for the disk part, HDFS is
just a choice. I am not using hdfs myself, but you will loss the benefit of
hdfs as well. In other words, it's also just based on your requirements.
And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


2013/12/23 Ognen Duzlevski og...@nengoiksvelzud.com

 Hello, I am new to Spark and have installed it, played with it a bit,
 mostly I am reading through the Fast data processing with Spark book.

 One of the first things I realized is that I have to learn Scala, the
 real-time data analytics part is not supported by the Python API, correct?
 I don't mind, Scala seems to be a lovely language! :)

 Anyways, I would like to set up a data analysis pipeline where I have
 already done the job of exposing a port on the internet (amazon elastic
 load balancer) that feeds real-time data from tens-hundreds of thousands of
 clients in real-time into a set of internal instances which are essentially
 zeroMQ sockets (I do this via mongrel2 and associated handlers).

 These handlers can themselves create 0mq sockets to feed data into a
 pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best.

 One of the pipelines I am evaluating is Spark.

 There seems to be information on Spark but for some reason I find it to be
 very Hadoop specific. HDFS is mentioned a lot, for example. What if I don't
 use Hadoop/HDFS?

 What do people do when they want to inhale real-time information? Let's
 say I want to use 0mq. Does Spark allow for that? How would I go about
 doing this?

 What about dumping all the data into a persistent store? Can I dump into
 DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq handlers can
 do that upon receipt of data before it sees the pipeline but sometimes
 storing intermediate results helps too...

 Thanks!
 OD



Re: failed to compile spark because of the missing packages

2013-12-23 Thread Nan Zhu
I finally solved the issue manually  

I found that when I compile with sbt, lib/ directory under streaming/ and repl/ 
is missing,  

The reason is that in the official .gitignore, it intends to ignore the “lib/“, 
while in the distributed tgz files, these two lib/ directories are included….

Best,  

--  
Nan Zhu


On Monday, December 23, 2013 at 4:12 PM, Nan Zhu wrote:

 Hi, Patrick  
  
 Thanks for the reply
  
 I still failed to compile the code, even I made the following attempts
  
 1. download spark-0.8.1.tgz,  
  
 2. decompress, and copy the files to the github local repo directory 
 (.gitignore is just copied from 
 https://github.com/apache/incubator-spark/blob/master/.gitignore)
  
 3. push files to git repo
  
 4. pull files in the desktop  
  
 5. sbt/sbt assembly/assembly, failed with the same error as my last email
  
 any further comments?
  
 Best,  
  
 --  
 Nan Zhu
  
  
 On Monday, December 23, 2013 at 12:22 PM, Patrick Wendell wrote:
  
  Hey Nan,
   
  You shouldn't copy lib_managed manually. SBT will deal with that. Try
  just using the same .gitignore settings that we have in the spark
  github. Seems like you are accidentally including some files that
  cause this to get messed up.
   
  - Patrick
   
  On Mon, Dec 23, 2013 at 8:37 AM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, all

   I just downloaded spark 0.8.1, made some modification, and compile in my
   laptop, everything works fine

   I sync the source code directory with my desktop via github (ignore all
   .jars and target), and then I copied lib-managed directory to my desktop

   I tried to compile with sbt. It throws out the following errors:

   Can any one tell me what can be the reason of these errors?

   Thank you very much!


   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:12:
   object jline is not a member of package tools
   [error] import scala.tools.jline.console.completer._
   [error] ^
   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:11:
   object jline is not a member of package tools
   [error] import scala.tools.jline._
   [error] ^
   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:819:
   type mismatch;
   [error] found : org.apache.spark.repl.SparkJLineReader
   [error] required: scala.tools.nsc.interpreter.InteractiveReader
   [error] else try SparkJLineReader(
   [error] ^
   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala:1012:
   type mismatch;
   [error] found : org.apache.spark.repl.SparkJLineReader
   [error] required: scala.tools.nsc.interpreter.InteractiveReader
   [error] repl.in (http://repl.in) = SparkJLineReader(repl)
   [error] ^
   [error]
   /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:258:
   not found: value kafka
   [error] kafkaStream[String, kafka.serializer.StringDecoder](kafkaParams,
   topics, storageLevel)
   [error] ^
   [error]
   /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:269:
   not found: value kafka
   [error] def kafkaStream[T: ClassManifest, D :
   kafka.serializer.Decoder[_]: Manifest](
   [error] ^
   [error]
   /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/dstream/KafkaInputDStream.scala:27:
   not found: object kafka
   [error] import kafka.consumer._
   [error] ^
   [error]
   /home/zhunan/spark/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala:274:
   ambiguous implicit values:
   [error] both method fallbackStringCanBuildFrom in class
   LowPriorityImplicits of type [T]=
   scala.collection.generic.CanBuildFrom[String,T,scala.collection.immutable.IndexedSeq[T]]
   [error] and value evidence$5 of type Manifest[D]
   [error] match expected type error
   [error] val inputStream = new KafkaInputDStream[T, D](this, kafkaParams,
   topics, storageLevel)
   [error] ^
   [warn]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:32:
   type error in type pattern error is unchecked since it is eliminated 
   by
   erasure
   [warn] catch { case _: MissingRequirementError = None }
   [warn] ^
   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:290:
   value executionFor is not a member of object SparkJLineCompletion.this.ids
   [error] (ids executionFor parsed) orElse
   [error] ^
   [warn]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineCompletion.scala:373:
   type error in type pattern error is unchecked since it is eliminated 
   by
   erasure
   [warn] case ex: Exception =
   [warn] ^
   [error]
   /home/zhunan/spark/repl/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:11:
   object jline is not a member of package tools
   [error] import 

Re: debugging NotSerializableException while using Kryo

2013-12-23 Thread Ameet Kini
Using Java serialization would make the NPE go away, but it would be a less
preferable solution. My application is network-intensive, and serialization
cost is significant. In other words, these objects are ideal candidates for
Kryo.





On Mon, Dec 23, 2013 at 3:41 PM, Jie Deng deng113...@gmail.com wrote:

 maybe try to implement your class with serializable...


 2013/12/23 Ameet Kini ameetk...@gmail.com

 Thanks Imran.

 I tried setting spark.closure.serializer to
 org.apache.spark.serializer.KryoSerializer and now end up seeing
 NullPointerException when the executor starts up. This is a snippet of the
 executor's log. Notice how registered TileIdWritable and registered
 ArgWritable is called, so I see that my KryoRegistrator is being called.
 However, it's not clear why there's a follow-on NPE. My spark log level is
 set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if
 there s
 some other way to get the executor to be more verbose as to the cause of
 the NPE.

 When I take out the spark.closure.serializer setting (i.e., go back to
 the default Java serialization), the executors start up fine, and executes
 other RDD actions, but of course not the lookup action (my original
 problem). With the spark.closure.serializer setting to kryo, it dies with
 an NPE during executor startup.


 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver:
 akka.tcp://[redacted]:48147/user/StandaloneSchedulerhttp://sp...@karadi.spadac.com:48147/user/StandaloneScheduler
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered
 with driver
 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started
 13/12/23 11:00:36 INFO Remoting: Starting remoting
 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp:/[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://[redacted]:56483 http://sp...@karadi.spadac.com:56483/]
 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster:
 akka.tcp://[redacted]:48147/user/BlockManagerMasterhttp://sp...@karadi.spadac.com:48147/user/BlockManagerMaster
 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity
 323.9 MB.
 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root
 dirs '/tmp'
 13/12/23 11:00:36 INFO DiskStore: Created local directory at
 /tmp/spark-local-20131223110036-4335
 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with
 id = ConnectionManagerId([redacted],41617)
 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager
 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager
 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker:
 akka.tcp:/[redacted]:48147/user/MapOutputTrackerhttp://sp...@karadi.spadac.com:48147/user/MapOutputTracker
 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator:
 geotrellis.spark.KryoRegistrator
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 13/12/23 11:00:37 INFO Executor: Running task ID 2
 13/12/23 11:00:37 INFO Executor: Running task ID 1
 13/12/23 11:00:37 INFO Executor: Running task ID 3
 13/12/23 11:00:37 INFO Executor: Running task ID 0
 13/12/23 11:00:37 INFO Executor: Fetching
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with
 timestamp 1387814434436
 13/12/23 11:00:37 INFO Utils: Fetching
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to
 /tmp/fetchFileTemp2456419097284083628.tmp
 13/12/23 11:00:37 INFO Executor: Adding
 file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar
 to class loader
 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread
 Thread[pool-7-thread-4,5,main]
 java.lang.NullPointerException
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at scala.Option.flatMap(Option.scala:170)
 at
 

Re: debugging NotSerializableException while using Kryo

2013-12-23 Thread Michael (Bach) Bui
What spark version are you using? By looking at the code Executor.scala 
line195, you will at least know what cause the NPE.
We can start from there.



On Dec 23, 2013, at 10:21 AM, Ameet Kini ameetk...@gmail.com wrote:

 Thanks Imran. 
 
 I tried setting spark.closure.serializer to 
 org.apache.spark.serializer.KryoSerializer and now end up seeing 
 NullPointerException when the executor starts up. This is a snippet of the 
 executor's log. Notice how registered TileIdWritable and registered 
 ArgWritable is called, so I see that my KryoRegistrator is being called. 
 However, it's not clear why there's a follow-on NPE. My spark log level is 
 set to DEBUG in log4j.properties (log4j.rootCategory=DEBUG) so not sure if 
 there s 
 some other way to get the executor to be more verbose as to the cause of the 
 NPE. 
 
 When I take out the spark.closure.serializer setting (i.e., go back to the 
 default Java serialization), the executors start up fine, and executes other 
 RDD actions, but of course not the lookup action (my original problem). With 
 the spark.closure.serializer setting to kryo, it dies with an NPE during 
 executor startup. 
 
 
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Connecting to driver: 
 akka.tcp://[redacted]:48147/user/StandaloneScheduler
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Successfully registered 
 with driver
 13/12/23 11:00:36 INFO Slf4jLogger: Slf4jLogger started
 13/12/23 11:00:36 INFO Remoting: Starting remoting
 13/12/23 11:00:36 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp:/[redacted]:56483]
 13/12/23 11:00:36 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://[redacted]:56483]
 13/12/23 11:00:36 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://[redacted]:48147/user/BlockManagerMaster
 13/12/23 11:00:36 INFO MemoryStore: MemoryStore started with capacity 323.9 
 MB.
 13/12/23 11:00:36 DEBUG DiskStore: Creating local directories at root dirs 
 '/tmp'
 13/12/23 11:00:36 INFO DiskStore: Created local directory at 
 /tmp/spark-local-20131223110036-4335
 13/12/23 11:00:36 INFO ConnectionManager: Bound socket to port 41617 with id 
 = ConnectionManagerId([redacted],41617)
 13/12/23 11:00:36 INFO BlockManagerMaster: Trying to register BlockManager
 13/12/23 11:00:36 INFO BlockManagerMaster: Registered BlockManager
 13/12/23 11:00:36 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp:/[redacted]:48147/user/MapOutputTracker
 13/12/23 11:00:36 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-e71e0a2b-a247-4bb8-b06d-19c12467b65a
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 0
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 1
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 2
 13/12/23 11:00:36 INFO StandaloneExecutorBackend: Got assigned task 3
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: 
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: 
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: 
 geotrellis.spark.KryoRegistrator
 13/12/23 11:00:36 DEBUG KryoSerializer: Running user registrator: 
 geotrellis.spark.KryoRegistrator
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered TileIdWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 registered ArgWritable
 13/12/23 11:00:37 INFO Executor: Running task ID 2
 13/12/23 11:00:37 INFO Executor: Running task ID 1
 13/12/23 11:00:37 INFO Executor: Running task ID 3
 13/12/23 11:00:37 INFO Executor: Running task ID 0
 13/12/23 11:00:37 INFO Executor: Fetching 
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar with 
 timestamp 1387814434436
 13/12/23 11:00:37 INFO Utils: Fetching 
 http://10.194.70.21:51166/jars/geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar to 
 /tmp/fetchFileTemp2456419097284083628.tmp
 13/12/23 11:00:37 INFO Executor: Adding 
 file[redacted]/spark/work/app-20131223110034-/0/./geotrellis-spark_2.10-0.9.0-SNAPSHOT.jar
  to class loader
 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread 
 Thread[pool-7-thread-4,5,main]
 java.lang.NullPointerException
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:195)
 at scala.Option.flatMap(Option.scala:170)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 13/12/23 11:00:37 ERROR Executor: Uncaught exception in thread 
 Thread[pool-7-thread-2,5,main]
 java.lang.NullPointerException
 at 
 

Re: Noob Spark questions

2013-12-23 Thread Mark Hamstra

 Though there is a saying that Java in Spark is slower than Scala shell


That shouldn't be said.  The Java API is mostly a thin wrapper of the Scala
implementation, and the performance of the Java API is intended to be
equivalent to that of the Scala API.  If you're finding that not to be
true, then that is something that the Spark developers would like to know.


On Mon, Dec 23, 2013 at 1:23 PM, Jie Deng deng113...@gmail.com wrote:

 I am using Java, and Spark has APIs for Java as well. Though there is a
 saying that Java in Spark is slower than Scala shell, well, depends on your
 requirement.
 I am not an expert in Spark, but as far as I know, Spark provide different
 level of storage including memory or disk. And for the disk part, HDFS is
 just a choice. I am not using hdfs myself, but you will loss the benefit of
 hdfs as well. In other words, it's also just based on your requirements.
 And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


 2013/12/23 Ognen Duzlevski og...@nengoiksvelzud.com

 Hello, I am new to Spark and have installed it, played with it a bit,
 mostly I am reading through the Fast data processing with Spark book.

 One of the first things I realized is that I have to learn Scala, the
 real-time data analytics part is not supported by the Python API, correct?
 I don't mind, Scala seems to be a lovely language! :)

 Anyways, I would like to set up a data analysis pipeline where I have
 already done the job of exposing a port on the internet (amazon elastic
 load balancer) that feeds real-time data from tens-hundreds of thousands of
 clients in real-time into a set of internal instances which are essentially
 zeroMQ sockets (I do this via mongrel2 and associated handlers).

 These handlers can themselves create 0mq sockets to feed data into a
 pipeline via a 0mq push/pull, pub/sub or whatever mechanism works best.

 One of the pipelines I am evaluating is Spark.

 There seems to be information on Spark but for some reason I find it to
 be very Hadoop specific. HDFS is mentioned a lot, for example. What if I
 don't use Hadoop/HDFS?

 What do people do when they want to inhale real-time information? Let's
 say I want to use 0mq. Does Spark allow for that? How would I go about
 doing this?

 What about dumping all the data into a persistent store? Can I dump
 into DynamoDB or Mongo or...? How about Amazon S3? I suppose my 0mq
 handlers can do that upon receipt of data before it sees the pipeline but
 sometimes storing intermediate results helps too...

 Thanks!
 OD





Re: Noob Spark questions

2013-12-23 Thread Ognen Duzlevski
Hello,

On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng deng113...@gmail.com wrote:

 I am using Java, and Spark has APIs for Java as well. Though there is a
 saying that Java in Spark is slower than Scala shell, well, depends on your
 requirement.
 I am not an expert in Spark, but as far as I know, Spark provide different
 level of storage including memory or disk. And for the disk part, HDFS is
 just a choice. I am not using hdfs myself, but you will loss the benefit of
 hdfs as well. In other words, it's also just based on your requirements.
 And MongoDB or S3 are also doable, at least with Java APIs, I suppose.


I guess that answers the question of whether it is doable. Where/how do I
find out how it is doable? :)

I am guessing every pipeline is a custom job of sorts - hence it is the
developer's job to write the connectors to 0mq or dynamodb, for example?
Or? Is there some kind of a plug in system for Spark?

Thanks!


mapPartitions versus map overhead?

2013-12-23 Thread Huan Dao
Hi all, is there any overhead of mapPartitions versus overhead, if I implement 
an algorithm using map - reduce versus mapPartitions - reduce.
Thanks, 
Huan Dao



Re: Unable to load additional JARs in yarn-client mode

2013-12-23 Thread Matei Zaharia
I’m surprised by this, but one way that will definitely work is to assemble 
your application into a single JAR. If passing them to the constructor doesn’t 
work, that’s probably a bug.

Matei

On Dec 23, 2013, at 12:03 PM, Karavany, Ido ido.karav...@intel.com wrote:

 Hi All,
  
 For our application we need to use the yarn-client mode featured in 0.8.1. 
 (Yarn 2.0.5)
 We’ve successfully executed it both yarn-client and yarn-standalone with our 
 java applications.
  
 While in yarn-standalone there is a way to add external JARs – we couldn’t 
 find a way to add those in  yarn-client.
  
 Adding jars in spark context constructor or setting the SPARK_CLASSPATH 
 didn’t work as well.
  
 Are we missing something?
 Can you please advise?
 If it is currently impossible – can you advise a patch / workaround?
  
 It is crucial for us to get it working with external dependencies.
  
 Many Thanks,
 Ido
  
  
 -
 Intel Electronics Ltd.
 
 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.
 



RE: Unable to load additional JARs in yarn-client mode

2013-12-23 Thread Liu, Raymond
Ido, when you say add external JARS, do you mean by -addJars which adding some 
jar for SparkContext to use in the AM env?

If so, I think you don't need it for yarn-cilent mode at all, for yarn-client 
mode, SparkContext running locally, I think you just need to make sure those 
jars are in the java classpath.

And for those need by executors / tasks, I think , you can package it as Matei 
said. Or maybe we can expose some env for yarn-client mode to allowing adding 
multiple jars as needed.

Best Regards,
Raymond Liu

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, December 24, 2013 1:17 PM
To: user@spark.incubator.apache.org
Subject: Re: Unable to load additional JARs in yarn-client mode

I'm surprised by this, but one way that will definitely work is to assemble 
your application into a single JAR. If passing them to the constructor doesn't 
work, that's probably a bug.

Matei

On Dec 23, 2013, at 12:03 PM, Karavany, Ido 
ido.karav...@intel.commailto:ido.karav...@intel.com wrote:


Hi All,

For our application we need to use the yarn-client mode featured in 0.8.1. 
(Yarn 2.0.5)
We've successfully executed it both yarn-client and yarn-standalone with our 
java applications.

While in yarn-standalone there is a way to add external JARs - we couldn't find 
a way to add those in  yarn-client.

Adding jars in spark context constructor or setting the SPARK_CLASSPATH didn't 
work as well.

Are we missing something?
Can you please advise?
If it is currently impossible - can you advise a patch / workaround?

It is crucial for us to get it working with external dependencies.

Many Thanks,
Ido



-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.