Le 20 juin 2014 01:46, "Shivani Rao" <raoshiv...@gmail.com> a écrit :
>
> Hello Andrew,
>
> i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anything intense in the "processLogs" function
>
> import argonaut._
> import argonaut.Argonaut._
>
>
> /* all of these case classes are created from json strings extracted from
the line in the processLogs() function
> *
> */
> case class struct1…
> case class struct2…
> case class value1(struct1, struct2)
>
> def processLogs(line:String): Option[(key1, value1)] {…
> }
>
> def run(sparkMaster, appName, executorMemory, jarsPath) {
>   val sparkConf = new SparkConf()
>    sparkConf.setMaster(sparkMaster)
>    sparkConf.setAppName(appName)
>    sparkConf.set("spark.executor.memory", executorMemory)
>     sparkConf.setJars(jarsPath) // This includes all the jars relevant
jars..
>    val sc = new SparkContext(sparkConf)
>   val rawLogs = sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>
rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>
rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
> }
>
> If I switch to "local" mode, the code runs just fine, it fails with the
error I pasted above. In the cluster mode, even writing back the file we
just read fails
(rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>
> I still believe this is a classNotFound error in disguise
>

Indeed you are right, this can be the reason. I had similar errors when
defining case classes in the shell and trying to use them in the RDDs. Are
you shading argonaut in the fat jar ?

> Thanks
> Shivani
>
>
>
> On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <and...@andrewash.com> wrote:
>>
>> Wait, so the file only has four lines and the job running out of heap
space?  Can you share the code you're running that does the processing?
 I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.
>>
>> I
>>
>>
>> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <raoshiv...@gmail.com>
wrote:
>>>
>>> I am trying to process a file that contains 4 log lines (not very long)
and then write my parsed out case classes to a destination folder, and I
get the following error:
>>>
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>>
>>> at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>>
>>> at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>
>>> at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>
>>> at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>> at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>>
>>> at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>>
>>> at
org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>>
>>> at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>> at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>>
>>> Sadly, there are several folks that have faced this error while trying
to execute Spark jobs and there are various solutions, none of which work
for me
>>>
>>>
>>> a) I tried (
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
changing the number of partitions in my RDD by using coalesce(8) and the
error persisted
>>>
>>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g,
and both did not work
>>>
>>> c) I strongly suspect there is a class path error (
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
Mainly because the call stack is repetitive. Maybe the OOM error is a
disguise ?
>>>
>>> d) I checked that i am not out of disk space and that i do not have too
many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
wc -l)
>>>
>>>
>>> I am also noticing multiple reflections happening to find the right
"class" i guess, so it could be "class Not Found: error disguising itself
as a memory error.
>>>
>>>
>>> Here are other threads that are encountering same situation .. but have
not been resolved in any way so far..
>>>
>>>
>>>
http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>>
>>>
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>>
>>>
>>> Any help is greatly appreciated. I am especially calling out on
creators of Spark and Databrick folks. This seems like a "known bug"
waiting to happen.
>>>
>>>
>>> Thanks,
>>>
>>> Shivani
>>>
>>>
>>> --
>>> Software Engineer
>>> Analytics Engineering Team@ Box
>>> Mountain View, CA
>>
>>
>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA

Reply via email to