In short, ADD_JARS will add the jar to your driver classpath and also send it to the workers (similar to what you are doing when you do sc.addJars).
ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell You also have SPARK_CLASSPATH var but it does not distribute the code, it is only used to compute the driver classpath. BTW, you are not supposed to change the compute_classpath.script 2014-06-20 19:45 GMT+02:00 Shivani Rao <raoshiv...@gmail.com>: > Hello Eugene, > > You are right about this. I did encounter the "pergmgenspace" in the spark > shell. Can you tell me a little more about "ADD_JARS". In order to ensure > my spark_shell has all required jars, I added the jars to the "$CLASSPATH" > in the compute_classpath.sh script. is there another way of doing it? > > Shivani > > > On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi <cepoi.eu...@gmail.com> > wrote: > >> In my case it was due to a case class I was defining in the spark-shell >> and not being available on the workers. So packaging it in a jar and adding >> it with ADD_JARS solved the problem. Note that I don't exactly remember if >> it was an out of heap space exception or pergmen space. Make sure your >> jarsPath is correct. >> >> Usually to debug this kind of problems I am using the spark-shell (you >> can do the same in your job but its more time constuming to repackage, >> deploy, run, iterate). Try for example >> 1) read the lines (without any processing) and count them >> 2) apply processing and count >> >> >> >> 2014-06-20 17:15 GMT+02:00 Shivani Rao <raoshiv...@gmail.com>: >> >> Hello Abhi, I did try that and it did not work >>> >>> And Eugene, Yes I am assembling the argonaut libraries in the fat jar. >>> So how did you overcome this problem? >>> >>> Shivani >>> >>> >>> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <cepoi.eu...@gmail.com> >>> wrote: >>> >>>> >>>> Le 20 juin 2014 01:46, "Shivani Rao" <raoshiv...@gmail.com> a écrit : >>>> >>>> > >>>> > Hello Andrew, >>>> > >>>> > i wish I could share the code, but for proprietary reasons I can't. >>>> But I can give some idea though of what i am trying to do. The job reads a >>>> file and for each line of that file and processors these lines. I am not >>>> doing anything intense in the "processLogs" function >>>> > >>>> > import argonaut._ >>>> > import argonaut.Argonaut._ >>>> > >>>> > >>>> > /* all of these case classes are created from json strings extracted >>>> from the line in the processLogs() function >>>> > * >>>> > */ >>>> > case class struct1… >>>> > case class struct2… >>>> > case class value1(struct1, struct2) >>>> > >>>> > def processLogs(line:String): Option[(key1, value1)] {… >>>> > } >>>> > >>>> > def run(sparkMaster, appName, executorMemory, jarsPath) { >>>> > val sparkConf = new SparkConf() >>>> > sparkConf.setMaster(sparkMaster) >>>> > sparkConf.setAppName(appName) >>>> > sparkConf.set("spark.executor.memory", executorMemory) >>>> > sparkConf.setJars(jarsPath) // This includes all the jars >>>> relevant jars.. >>>> > val sc = new SparkContext(sparkConf) >>>> > val rawLogs = >>>> sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt") >>>> > >>>> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting") >>>> > >>>> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt") >>>> > } >>>> > >>>> > If I switch to "local" mode, the code runs just fine, it fails with >>>> the error I pasted above. In the cluster mode, even writing back the file >>>> we just read fails >>>> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting") >>>> > >>>> > I still believe this is a classNotFound error in disguise >>>> > >>>> >>>> Indeed you are right, this can be the reason. I had similar errors when >>>> defining case classes in the shell and trying to use them in the RDDs. Are >>>> you shading argonaut in the fat jar ? >>>> >>>> > Thanks >>>> > Shivani >>>> > >>>> > >>>> > >>>> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <and...@andrewash.com> >>>> wrote: >>>> >> >>>> >> Wait, so the file only has four lines and the job running out of >>>> heap space? Can you share the code you're running that does the >>>> processing? I'd guess that you're doing some intense processing on every >>>> line but just writing parsed case classes back to disk sounds very >>>> lightweight. >>>> >> >>>> >> I >>>> >> >>>> >> >>>> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <raoshiv...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> I am trying to process a file that contains 4 log lines (not very >>>> long) and then write my parsed out case classes to a destination folder, >>>> and I get the following error: >>>> >>> >>>> >>> >>>> >>> java.lang.OutOfMemoryError: Java heap space >>>> >>> >>>> >>> at >>>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) >>>> >>> >>>> >>> at >>>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244) >>>> >>> >>>> >>> at >>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280) >>>> >>> >>>> >>> at >>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75) >>>> >>> >>>> >>> at >>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) >>>> >>> >>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> >>> >>>> >>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> >>> >>>> >>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> >>> >>>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> >>> >>>> >>> at >>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328) >>>> >>> >>>> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) >>>> >>> >>>> >>> at >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) >>>> >>> >>>> >>> at >>>> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165) >>>> >>> >>>> >>> at >>>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) >>>> >>> >>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> >>> >>>> >>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> >>> >>>> >>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> >>> >>>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> >>> >>>> >>> at >>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870) >>>> >>> >>>> >>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752) >>>> >>> >>>> >>> >>>> >>> Sadly, there are several folks that have faced this error while >>>> trying to execute Spark jobs and there are various solutions, none of which >>>> work for me >>>> >>> >>>> >>> >>>> >>> a) I tried ( >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736) >>>> changing the number of partitions in my RDD by using coalesce(8) and the >>>> error persisted >>>> >>> >>>> >>> b) I tried changing SPARK_WORKER_MEM=2g, >>>> SPARK_EXECUTOR_MEMORY=10g, and both did not work >>>> >>> >>>> >>> c) I strongly suspect there is a class path error ( >>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html) >>>> Mainly because the call stack is repetitive. Maybe the OOM error is a >>>> disguise ? >>>> >>> >>>> >>> d) I checked that i am not out of disk space and that i do not have >>>> too many open files (ulimit -u << sudo ls >>>> /proc/<spark_master_process_id>/fd | wc -l) >>>> >>> >>>> >>> >>>> >>> I am also noticing multiple reflections happening to find the right >>>> "class" i guess, so it could be "class Not Found: error disguising itself >>>> as a memory error. >>>> >>> >>>> >>> >>>> >>> Here are other threads that are encountering same situation .. but >>>> have not been resolved in any way so far.. >>>> >>> >>>> >>> >>>> >>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html >>>> >>> >>>> >>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html >>>> >>> >>>> >>> >>>> >>> Any help is greatly appreciated. I am especially calling out on >>>> creators of Spark and Databrick folks. This seems like a "known bug" >>>> waiting to happen. >>>> >>> >>>> >>> >>>> >>> Thanks, >>>> >>> >>>> >>> Shivani >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> Software Engineer >>>> >>> Analytics Engineering Team@ Box >>>> >>> Mountain View, CA >>>> >> >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > Software Engineer >>>> > Analytics Engineering Team@ Box >>>> > Mountain View, CA >>>> >>> >>> >>> >>> -- >>> Software Engineer >>> Analytics Engineering Team@ Box >>> Mountain View, CA >>> >> >> > > > -- > Software Engineer > Analytics Engineering Team@ Box > Mountain View, CA >