Pyspark worker memory

2014-03-18 Thread Jim Blomo
Hello, I'm using the Github snapshot of PySpark and having trouble setting
the worker memory correctly. I've set spark.executor.memory to 5g, but
somewhere along the way Xmx is getting capped to 512M. This was not
occurring with the same setup and 0.9.0. How many places do I need to
configure the memory? Thank you!


Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
Thanks for the suggestion, Matei.  I've tracked this down to a setting
I had to make on the Driver.  It looks like spark-env.sh has no impact
on the Executor, which confused me for a long while with settings like
SPARK_EXECUTOR_MEMORY.  The only setting that mattered was setting the
system property in the *driver* (in this case pyspark/shell.py) or
using -Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*.  I'm
not sure how this varies from 0.9.0 release, but it seems to work on
SNAPSHOT.

On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia  wrote:
> Try checking spark-env.sh on the workers as well. Maybe code there is
> somehow overriding the spark.executor.memory setting.
>
> Matei
>
> On Mar 18, 2014, at 6:17 PM, Jim Blomo  wrote:
>
> Hello, I'm using the Github snapshot of PySpark and having trouble setting
> the worker memory correctly. I've set spark.executor.memory to 5g, but
> somewhere along the way Xmx is getting capped to 512M. This was not
> occurring with the same setup and 0.9.0. How many places do I need to
> configure the memory? Thank you!
>
>


Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
To document this, it would be nice to clarify what environment
variables should be used to set which Java system properties, and what
type of process they affect.  I'd be happy to start a page if you can
point me to the right place:

SPARK_JAVA_OPTS:
  -Dspark.executor.memory can by set on the machine running the driver
(typically the master host) and will affect the memory available to
the Executor running on a slave node
  -D

SPARK_DAEMON_OPTS:
  

On Wed, Mar 19, 2014 at 12:48 AM, Jim Blomo  wrote:
> Thanks for the suggestion, Matei.  I've tracked this down to a setting
> I had to make on the Driver.  It looks like spark-env.sh has no impact
> on the Executor, which confused me for a long while with settings like
> SPARK_EXECUTOR_MEMORY.  The only setting that mattered was setting the
> system property in the *driver* (in this case pyspark/shell.py) or
> using -Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*.  I'm
> not sure how this varies from 0.9.0 release, but it seems to work on
> SNAPSHOT.
>
> On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia  
> wrote:
>> Try checking spark-env.sh on the workers as well. Maybe code there is
>> somehow overriding the spark.executor.memory setting.
>>
>> Matei
>>
>> On Mar 18, 2014, at 6:17 PM, Jim Blomo  wrote:
>>
>> Hello, I'm using the Github snapshot of PySpark and having trouble setting
>> the worker memory correctly. I've set spark.executor.memory to 5g, but
>> somewhere along the way Xmx is getting capped to 512M. This was not
>> occurring with the same setup and 0.9.0. How many places do I need to
>> configure the memory? Thank you!
>>
>>


Re: PySpark worker fails with IOError Broken Pipe

2014-03-20 Thread Jim Blomo
I think I've encountered the same problem and filed
https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1284

For me it hung the worker, though. Can you add reproducible steps and what
version you're running?
On Mar 19, 2014 10:13 PM, "Nicholas Chammas" 
wrote:

> So I have the pyspark shell open and after some idle time I sometimes get
> this:
>
> >>> PySpark worker failed with exception:
>> Traceback (most recent call last):
>>   File "/root/spark/python/pyspark/worker.py", line 77, in main
>> serializer.dump_stream(func(split_index, iterator), outfile)
>>   File "/root/spark/python/pyspark/serializers.py", line 182, in
>> dump_stream
>> self.serializer.dump_stream(self._batched(iterator), stream)
>>   File "/root/spark/python/pyspark/serializers.py", line 118, in
>> dump_stream
>> self._write_with_length(obj, stream)
>>   File "/root/spark/python/pyspark/serializers.py", line 130, in
>> _write_with_length
>> stream.write(serialized)
>> IOError: [Errno 32] Broken pipe
>> Traceback (most recent call last):
>>   File "/root/spark/python/pyspark/daemon.py", line 117, in launch_worker
>> worker(listen_sock)
>>   File "/root/spark/python/pyspark/daemon.py", line 107, in worker
>> outfile.flush()
>> IOError: [Errno 32] Broken pipe
>
>
> The shell is still alive and I can continue to do work.
>
> Is this anything to worry about or fix?
>
> Nick
>
>
> --
> View this message in context: PySpark worker fails with IOError Broken
> Pipe
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


pySpark memory usage

2014-03-21 Thread Jim Blomo
Hi all, I'm wondering if there's any settings I can use to reduce the
memory needed by the PythonRDD when computing simple stats.  I am
getting OutOfMemoryError exceptions while calculating count() on big,
but not absurd, records.  It seems like PythonRDD is trying to keep
too many of these records in memory, when all that is needed is to
stream through them and count.  Any tips for getting through this
workload?


Code:
session = sc.textFile('s3://...json.gz') # ~54GB of compressed data

# the biggest individual text line is ~3MB
parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s):
(loads(y), loads(s)))
parsed.persist(StorageLevel.MEMORY_AND_DISK)

parsed.count()
# will never finish: executor.Executor: Uncaught exception will FAIL
all executors

Incidentally the whole app appears to be killed, but this error is not
propagated to the shell.

Cluster:
15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB)

Exception:
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)


Re: pySpark memory usage

2014-03-27 Thread Jim Blomo
Thanks, Matei.  I am running "Spark 1.0.0-SNAPSHOT built for Hadoop
1.0.4" from GitHub on 2014-03-18.

I tried batchSizes of 512, 10, and 1 and each got me further but none
have succeeded.

I can get this to work -- with manual interventions -- if I omit
`parsed.persist(StorageLevel.MEMORY_AND_DISK)` and set batchSize=1.  5
of the 175 executors hung, and I had to kill the python process to get
things going again.  The only indication of this in the logs was `INFO
python.PythonRDD: stdin writer to Python finished early`.

With batchSize=1 and persist, a new memory error came up in several
tasks, before the app was failed:

14/03/28 01:51:15 ERROR executor.Executor: Uncaught exception in
thread Thread[stdin writer for python,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:327)
at org.apache.hadoop.io.Text.toString(Text.java:254)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)

There are other exceptions, but I think they all stem from the above,
eg. org.apache.spark.SparkException: Error sending message to
BlockManagerMaster

Let me know if there are other settings I should try, or if I should
try a newer snapshot.

Thanks again!


On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia  wrote:
> Hey Jim,
>
> In Spark 0.9 we added a "batchSize" parameter to PySpark that makes it group 
> multiple objects together before passing them between Java and Python, but 
> this may be too high by default. Try passing batchSize=10 to your 
> SparkContext constructor to lower it (the default is 1024). Or even 
> batchSize=1 to match earlier versions.
>
> Matei
>
> On Mar 21, 2014, at 6:18 PM, Jim Blomo  wrote:
>
>> Hi all, I'm wondering if there's any settings I can use to reduce the
>> memory needed by the PythonRDD when computing simple stats.  I am
>> getting OutOfMemoryError exceptions while calculating count() on big,
>> but not absurd, records.  It seems like PythonRDD is trying to keep
>> too many of these records in memory, when all that is needed is to
>> stream through them and count.  Any tips for getting through this
>> workload?
>>
>>
>> Code:
>> session = sc.textFile('s3://...json.gz') # ~54GB of compressed data
>>
>> # the biggest individual text line is ~3MB
>> parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s):
>> (loads(y), loads(s)))
>> parsed.persist(StorageLevel.MEMORY_AND_DISK)
>>
>> parsed.count()
>> # will never finish: executor.Executor: Uncaught exception will FAIL
>> all executors
>>
>> Incidentally the whole app appears to be killed, but this error is not
>> propagated to the shell.
>>
>> Cluster:
>> 15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB)
>>
>> Exception:
>> java.lang.OutOfMemoryError: Java heap space
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113)
>>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113)
>>at 
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>at 
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94)
>>at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
>


Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I've only tried 0.9, in which I ran into the `stdin writer to Python
finished early` so frequently I wasn't able to load even a 1GB file.
Let me know if I can provide any other info!

On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia  wrote:
> I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We'll 
> try to look into these, seems like a serious error.
>
> Matei
>
> On Mar 27, 2014, at 7:27 PM, Jim Blomo  wrote:
>
>> Thanks, Matei.  I am running "Spark 1.0.0-SNAPSHOT built for Hadoop
>> 1.0.4" from GitHub on 2014-03-18.
>>
>> I tried batchSizes of 512, 10, and 1 and each got me further but none
>> have succeeded.
>>
>> I can get this to work -- with manual interventions -- if I omit
>> `parsed.persist(StorageLevel.MEMORY_AND_DISK)` and set batchSize=1.  5
>> of the 175 executors hung, and I had to kill the python process to get
>> things going again.  The only indication of this in the logs was `INFO
>> python.PythonRDD: stdin writer to Python finished early`.
>>
>> With batchSize=1 and persist, a new memory error came up in several
>> tasks, before the app was failed:
>>
>> 14/03/28 01:51:15 ERROR executor.Executor: Uncaught exception in
>> thread Thread[stdin writer for python,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>at java.lang.String.(String.java:203)
>>at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
>>at java.nio.CharBuffer.toString(CharBuffer.java:1201)
>>at org.apache.hadoop.io.Text.decode(Text.java:350)
>>at org.apache.hadoop.io.Text.decode(Text.java:327)
>>at org.apache.hadoop.io.Text.toString(Text.java:254)
>>at 
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>at 
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>at 
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
>>
>> There are other exceptions, but I think they all stem from the above,
>> eg. org.apache.spark.SparkException: Error sending message to
>> BlockManagerMaster
>>
>> Let me know if there are other settings I should try, or if I should
>> try a newer snapshot.
>>
>> Thanks again!
>>
>>
>> On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia  
>> wrote:
>>> Hey Jim,
>>>
>>> In Spark 0.9 we added a "batchSize" parameter to PySpark that makes it 
>>> group multiple objects together before passing them between Java and 
>>> Python, but this may be too high by default. Try passing batchSize=10 to 
>>> your SparkContext constructor to lower it (the default is 1024). Or even 
>>> batchSize=1 to match earlier versions.
>>>
>>> Matei
>>>
>>> On Mar 21, 2014, at 6:18 PM, Jim Blomo  wrote:
>>>
>>>> Hi all, I'm wondering if there's any settings I can use to reduce the
>>>> memory needed by the PythonRDD when computing simple stats.  I am
>>>> getting OutOfMemoryError exceptions while calculating count() on big,
>>>> but not absurd, records.  It seems like PythonRDD is trying to keep
>>>> too many of these records in memory, when all that is needed is to
>>>> stream through them and count.  Any tips for getting through this
>>>> workload?
>>>>
>>>>
>>>> Code:
>>>> session = sc.textFile('s3://...json.gz') # ~54GB of compressed data
>>>>
>>>> # the biggest individual text line is ~3MB
>>>> parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s):
>>>> (loads(y), loads(s)))
>>>> parsed.persist(StorageLevel.MEMORY_AND_DISK)
>>>>
>>>> parsed.count()
>>>> # will never finish: executor.Executor: Uncaught exception will FAIL
>>>> all executors
>>>>
>>>> Incidentally the whole app appears to be killed, but this error is not
>>>> propagated to the shell.
>>>>
>>>> Cluster:
>>>> 15 m2.xlarges (17GB

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I think the problem I ran into in 0.9 is covered in
https://issues.apache.org/jira/browse/SPARK-1323

When I kill the python process, the stacktrace I gets indicates that
this happens at initialization.  It looks like the initial write to
the Python process does not go through, and then the iterator hangs
waiting for output.  I haven't had luck turning on debugging for the
executor process.  Still trying to learn the lgo4j properties that
need to be set.

No luck yet on tracking down the memory leak.

14/03/30 05:15:04 ERROR executor.Executor: Exception in task ID 11
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:168)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:174)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:113)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:231)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:222)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:52)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:212)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:42)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


On Sat, Mar 29, 2014 at 3:17 PM, Jim Blomo  wrote:
> I've only tried 0.9, in which I ran into the `stdin writer to Python
> finished early` so frequently I wasn't able to load even a 1GB file.
> Let me know if I can provide any other info!
>
> On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia  
> wrote:
>> I see, did this also fail with previous versions of Spark (0.9 or 0.8)? 
>> We'll try to look into these, seems like a serious error.
>>
>> Matei
>>
>> On Mar 27, 2014, at 7:27 PM, Jim Blomo  wrote:
>>
>>> Thanks, Matei.  I am running "Spark 1.0.0-SNAPSHOT built for Hadoop
>>> 1.0.4" from GitHub on 2014-03-18.
>>>
>>> I tried batchSizes of 512, 10, and 1 and each got me further but none
>>> have succeeded.
>>>
>>> I can get this to work -- with manual interventions -- if I omit
>>> `parsed.persist(StorageLevel.MEMORY_AND_DISK)` and set batchSize=1.  5
>>> of the 175 executors hung, and I had to kill the python process to get
>>> things going again.  The only indication of this in the logs was `INFO
>>> python.PythonRDD: stdin writer to Python finished early`.
>>>
>>> With batchSize=1 and persist, a new memory error came up in several
>>> tasks, before the app was failed:
>>>
>>> 14/03/28 01:51:15 ERROR executor.Executor: Uncaught exception in
>>> thread Thread[stdin writer for python,5,main]
>>> java.lang.OutOfMemoryError: Java heap space
>>>at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>>at java.lang.String.(String.java:203)
>>>at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
>>>at java.nio.CharBuffer.toString(CharBuffer.java:1201)
>>>at org.apache.hadoop.io.Text.decode(Text.java:350)
>>>at org.apache.hadoop.io.Text.decode(Text.java:327)
>>>at org.apache.hadoop.io.Text.toString(Text.java:254)
>>>at 
>>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>>at 
>>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>>at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>>>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>at 
>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
>>>at 
>>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
Hi Matei, thanks for working with me to find these issues.

To summarize, the issues I've seen are:
0.9.0:
- https://issues.apache.org/jira/browse/SPARK-1323

SNAPSHOT 2014-03-18:
- When persist() used and batchSize=1, java.lang.OutOfMemoryError:
Java heap space.  To me this indicates a memory leak since Spark
should simply be counting records of size < 3MB
- Without persist(), "stdin writer to Python finished early" hangs the
application, unknown root cause

I've recently rebuilt another SNAPSHOT, git commit 16b8308 with
debugging turned on.  This gives me the stacktrace on the new "stdin"
problem:

14/04/09 22:22:45 DEBUG PythonRDD: stdin writer to Python finished early
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at 
org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
at 
org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
at 
org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)
at 
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:98)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:192)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:175)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)


On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia  wrote:
> Cool, thanks for the update. Have you tried running a branch with this fix 
> (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak 
> issue are you referring to, is it separate from this? (Couldn't find it 
> earlier in the thread.)
>
> To turn on debug logging, copy conf/log4j.properties.template to 
> conf/log4j.properties and change the line log4j.rootCategory=INFO, console to 
> log4j.rootCategory=DEBUG, console. Then make sure this file is present in 
> "conf" on all workers.
>
> BTW I've managed to run PySpark with this fix on some reasonably large S3 
> data (multiple GB) and it was fine. It might happen only if records are 
> large, or something like that. How much heap are you giving to your 
> executors, and does it show that much in the web UI?
>
> Matei
>
> On Mar 29, 2014, at 10:44 PM, Jim Blomo  wrote:
>
>> I think the problem I ran into in 0.9 is covered in
>> https://issues.apache.org/jira/browse/SPARK-1323
>>
>> When I kill the python process, the stacktrace I gets indicates that
>> this happens at initialization.  It looks like the initial write to
>> the Python process does not go through, and then the iterator hangs
>> waiting for output.  I haven't had luck turning on debugging for the
>> executor process.  Still trying to learn the lgo4j properties that
>> need to be set.
>&

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
This dataset is uncompressed text at ~54GB. stats() returns (count:
56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
343)

On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia  wrote:
> Okay, thanks. Do you have any info on how large your records and data file 
> are? I'd like to reproduce and fix this.
>
> Matei
>
> On Apr 9, 2014, at 3:52 PM, Jim Blomo  wrote:
>
>> Hi Matei, thanks for working with me to find these issues.
>>
>> To summarize, the issues I've seen are:
>> 0.9.0:
>> - https://issues.apache.org/jira/browse/SPARK-1323
>>
>> SNAPSHOT 2014-03-18:
>> - When persist() used and batchSize=1, java.lang.OutOfMemoryError:
>> Java heap space.  To me this indicates a memory leak since Spark
>> should simply be counting records of size < 3MB
>> - Without persist(), "stdin writer to Python finished early" hangs the
>> application, unknown root cause
>>
>> I've recently rebuilt another SNAPSHOT, git commit 16b8308 with
>> debugging turned on.  This gives me the stacktrace on the new "stdin"
>> problem:
>>
>> 14/04/09 22:22:45 DEBUG PythonRDD: stdin writer to Python finished early
>> java.net.SocketException: Connection reset
>>at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>at 
>> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>>at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>at 
>> org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
>>at 
>> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
>>at java.io.FilterInputStream.read(FilterInputStream.java:133)
>>at 
>> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
>>at 
>> org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)
>>at 
>> org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)
>>at 
>> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:98)
>>at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>at java.io.DataInputStream.read(DataInputStream.java:100)
>>at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>>at 
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
>>at 
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>>at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:192)
>>at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:175)
>>at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>>at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
>>at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
>>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>at 
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
>>at 
>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
>>
>>
>> On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia  
>> wrote:
>>> Cool, thanks for the update. Have you tried running a branch with this fix 
>>> (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak 
>>> issue are you referring to, is it separate from this? (Couldn't find it 
>>> earlier in the thread.)
>>>
>>> To turn on debug logging, copy conf/log4j.properties.template to 
>>> conf/log4j.properties and change the line log4j.rootCategory=INFO, console 
>&

Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash  wrote:
> The biggest issue I've come across is that the cluster is somewhat unstable
> when under memory pressure.  Meaning that if you attempt to persist an RDD
> that's too big for memory, even with MEMORY_AND_DISK, you'll often still get
> OOMs.  I had to carefully modify some of the space tuning parameters and GC
> settings to get some jobs to even finish.

Would you mind sharing some of these settings?  Even just a GitHub
gist would be helpful.  These are the main issues I've run into as
well, and memory pressure also seems to be correlated with akka
timeouts, possibly because of GC pauses.


Finding bad data

2014-04-24 Thread Jim Blomo
I'm using PySpark to load some data and getting an error while
parsing it.  Is it possible to find the source file and line of the bad
data?  I imagine that this would be extremely tricky when dealing with
multiple derived RRDs, so an answer with the caveat of "this only
works when running .map() on an textFile() RDD" is totally fine.
Perhaps if the line number and file was available in pyspark I could
catch the exception and output it with the context?

Anyway to narrow down the problem input would be great. Thanks!


Re: pySpark memory usage

2014-04-28 Thread Jim Blomo
FYI, it looks like this "stdin writer to Python finished early" error was
caused by a break in the connection to S3, from which the data was being
pulled.  A recent commit to
PythonRDD<https://github.com/apache/spark/commit/a967b005c8937a3053e215c952d2172ee3dc300d#commitcomment-6114780>noted
that the current exception catching can potentially mask an exception
for the data source, and that is indeed what I see happening.  The
underlying libraries (jets3t and httpclient) do have retry capabilities,
but I don't see a great way of setting them through Spark code.  Instead I
added the patch below which kills the worker on the exception.  This allows
me to completely load the data source after a few worker retries.

Unfortunately, java.net.SocketException is the same error that is sometimes
expected from the client when using methods like take().  One approach
around this conflation is to create a new locally scoped exception class,
eg. WriterException, catch java.net.SocketException during output writing,
then re-throw the new exception.  The worker thread could then distinguish
between the reasons java.net.SocketException might be thrown.  Perhaps
there is a more elegant way to do this in Scala, though?

Let me know if I should open a ticket or discuss this on the developers
list instead.  Best,

Jim

diff --git
a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
index 0d71fdb..f31158c 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
@@ -95,6 +95,12 @@ private[spark] class PythonRDD[T: ClassTag](
 readerException = e
 Try(worker.shutdownOutput()) // kill Python worker process

+  case e: java.net.SocketException =>
+   // This can happen if a connection to the datasource, eg S3,
resets
+   // or is otherwise broken
+readerException = e
+Try(worker.shutdownOutput()) // kill Python worker process
+
   case e: IOException =>
 // This can happen for legitimate reasons if the Python code
stops returning data
 // before we are done passing elements through, e.g., for
take(). Just log a message to


On Wed, Apr 9, 2014 at 7:04 PM, Jim Blomo  wrote:

> This dataset is uncompressed text at ~54GB. stats() returns (count:
> 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
> 343)
>
> On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia 
> wrote:
> > Okay, thanks. Do you have any info on how large your records and data
> file are? I'd like to reproduce and fix this.
> >
> > Matei
> >
> > On Apr 9, 2014, at 3:52 PM, Jim Blomo  wrote:
> >
> >> Hi Matei, thanks for working with me to find these issues.
> >>
> >> To summarize, the issues I've seen are:
> >> 0.9.0:
> >> - https://issues.apache.org/jira/browse/SPARK-1323
> >>
> >> SNAPSHOT 2014-03-18:
> >> - When persist() used and batchSize=1, java.lang.OutOfMemoryError:
> >> Java heap space.  To me this indicates a memory leak since Spark
> >> should simply be counting records of size < 3MB
> >> - Without persist(), "stdin writer to Python finished early" hangs the
> >> application, unknown root cause
> >>
> >> I've recently rebuilt another SNAPSHOT, git commit 16b8308 with
> >> debugging turned on.  This gives me the stacktrace on the new "stdin"
> >> problem:
> >>
> >> 14/04/09 22:22:45 DEBUG PythonRDD: stdin writer to Python finished early
> >> java.net.SocketException: Connection reset
> >>at java.net.SocketInputStream.read(SocketInputStream.java:196)
> >>at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> >>at
> sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
> >>at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> >>at
> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
> >>at
> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
> >>at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> >>at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> >>at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >>at
> org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
> >>at
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
> &

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
Thanks, Aaron, this looks like a good solution!  Will be trying it out shortly.

I noticed that the S3 exception seem to occur more frequently when the
box is swapping.  Why is the box swapping?  combineByKey seems to make
the assumption that it can fit an entire partition in memory when
doing the combineLocally step.  I'm going to try to break this apart
but will need some sort of heuristic options include looking at memory
usage via the resource module and trying to keep below
'spark.executor.memory', or using batchSize to limit the number of
entries in the dictionary.  Let me know if you have any opinions.

On Sun, May 4, 2014 at 8:02 PM, Aaron Davidson  wrote:
> I'd just like to update this thread by pointing to the PR based on our
> initial design: https://github.com/apache/spark/pull/640
>
> This solution is a little more general and avoids catching IOException
> altogether. Long live exception propagation!
>
>
> On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell  wrote:
>>
>> Hey Jim,
>>
>> This IOException thing is a general issue that we need to fix and your
>> observation is spot-in. There is actually a JIRA for it here I created a few
>> days ago:
>> https://issues.apache.org/jira/browse/SPARK-1579
>>
>> Aaron is assigned on that one but not actively working on it, so we'd
>> welcome a PR from you on this if you are interested.
>>
>> The first thought we had was to set a volatile flag when the reader sees
>> an exception (indicating there was a failure in the task) and avoid
>> swallowing the IOException in the writer if this happens. But I think there
>> is a race here where the writer sees the error first before the reader knows
>> what is going on.
>>
>> Anyways maybe if you have a simpler solution you could sketch it out in
>> the JIRA and we could talk over there. The current proposal in the JIRA is
>> somewhat complicated...
>>
>> - Patrick
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo  wrote:
>>>
>>> FYI, it looks like this "stdin writer to Python finished early" error was
>>> caused by a break in the connection to S3, from which the data was being
>>> pulled.  A recent commit to PythonRDD noted that the current exception
>>> catching can potentially mask an exception for the data source, and that is
>>> indeed what I see happening.  The underlying libraries (jets3t and
>>> httpclient) do have retry capabilities, but I don't see a great way of
>>> setting them through Spark code.  Instead I added the patch below which
>>> kills the worker on the exception.  This allows me to completely load the
>>> data source after a few worker retries.
>>>
>>> Unfortunately, java.net.SocketException is the same error that is
>>> sometimes expected from the client when using methods like take().  One
>>> approach around this conflation is to create a new locally scoped exception
>>> class, eg. WriterException, catch java.net.SocketException during output
>>> writing, then re-throw the new exception.  The worker thread could then
>>> distinguish between the reasons java.net.SocketException might be thrown.
>>> Perhaps there is a more elegant way to do this in Scala, though?
>>>
>>> Let me know if I should open a ticket or discuss this on the developers
>>> list instead.  Best,
>>>
>>> Jim
>>>
>>> diff --git
>>> a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>>> b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>>> index 0d71fdb..f31158c 100644
>>> --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>>> +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>>> @@ -95,6 +95,12 @@ private[spark] class PythonRDD[T: ClassTag](
>>>  readerException = e
>>>  Try(worker.shutdownOutput()) // kill Python worker process
>>>
>>> +  case e: java.net.SocketException =>
>>> +   // This can happen if a connection to the datasource, eg S3,
>>> resets
>>> +   // or is otherwise broken
>>> +readerException = e
>>> +Try(worker.shutdownOutput()) // kill Python worker process
>>> +
>>>case e: IOException =>
>>>  // This can happen for legitimate reasons if the Python code
>>> stops returning data
>>>  // before we are done passing elements through, e.g., for
>>> take(). Just log a

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap
threshold, but below the "Too many open files" error (`ulimit -n` is
32768).

On Wed, May 14, 2014 at 10:47 AM, Jim Blomo  wrote:
> That worked amazingly well, thank you Matei!  Numbers that worked for
> me were 400 for the textFile()s, 1500 for the join()s.
>
> On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia  
> wrote:
>> Hey Jim, unfortunately external spilling is not implemented in Python right 
>> now. While it would be possible to update combineByKey to do smarter stuff 
>> here, one simple workaround you can try is to launch more map tasks (or more 
>> reduce tasks). To set the minimum number of map tasks, you can pass it as a 
>> second argument to textFile and such (e.g. sc.textFile(“s3n://foo.txt”, 
>> 1000)).
>>
>> Matei
>>
>> On May 12, 2014, at 5:47 PM, Jim Blomo  wrote:
>>
>>> Thanks, Aaron, this looks like a good solution!  Will be trying it out 
>>> shortly.
>>>
>>> I noticed that the S3 exception seem to occur more frequently when the
>>> box is swapping.  Why is the box swapping?  combineByKey seems to make
>>> the assumption that it can fit an entire partition in memory when
>>> doing the combineLocally step.  I'm going to try to break this apart
>>> but will need some sort of heuristic options include looking at memory
>>> usage via the resource module and trying to keep below
>>> 'spark.executor.memory', or using batchSize to limit the number of
>>> entries in the dictionary.  Let me know if you have any opinions.
>>>
>>> On Sun, May 4, 2014 at 8:02 PM, Aaron Davidson  wrote:
>>>> I'd just like to update this thread by pointing to the PR based on our
>>>> initial design: https://github.com/apache/spark/pull/640
>>>>
>>>> This solution is a little more general and avoids catching IOException
>>>> altogether. Long live exception propagation!
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell  
>>>> wrote:
>>>>>
>>>>> Hey Jim,
>>>>>
>>>>> This IOException thing is a general issue that we need to fix and your
>>>>> observation is spot-in. There is actually a JIRA for it here I created a 
>>>>> few
>>>>> days ago:
>>>>> https://issues.apache.org/jira/browse/SPARK-1579
>>>>>
>>>>> Aaron is assigned on that one but not actively working on it, so we'd
>>>>> welcome a PR from you on this if you are interested.
>>>>>
>>>>> The first thought we had was to set a volatile flag when the reader sees
>>>>> an exception (indicating there was a failure in the task) and avoid
>>>>> swallowing the IOException in the writer if this happens. But I think 
>>>>> there
>>>>> is a race here where the writer sees the error first before the reader 
>>>>> knows
>>>>> what is going on.
>>>>>
>>>>> Anyways maybe if you have a simpler solution you could sketch it out in
>>>>> the JIRA and we could talk over there. The current proposal in the JIRA is
>>>>> somewhat complicated...
>>>>>
>>>>> - Patrick
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo  wrote:
>>>>>>
>>>>>> FYI, it looks like this "stdin writer to Python finished early" error was
>>>>>> caused by a break in the connection to S3, from which the data was being
>>>>>> pulled.  A recent commit to PythonRDD noted that the current exception
>>>>>> catching can potentially mask an exception for the data source, and that 
>>>>>> is
>>>>>> indeed what I see happening.  The underlying libraries (jets3t and
>>>>>> httpclient) do have retry capabilities, but I don't see a great way of
>>>>>> setting them through Spark code.  Instead I added the patch below which
>>>>>> kills the worker on the exception.  This allows me to completely load the
>>>>>> data source after a few worker retries.
>>>>>>
>>>>>> Unfortunately, java.net.SocketException is the same error that is
>>>>>> sometimes expected from the client when using methods like take().  One
>>>>>> app

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
That worked amazingly well, thank you Matei!  Numbers that worked for
me were 400 for the textFile()s, 1500 for the join()s.

On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia  wrote:
> Hey Jim, unfortunately external spilling is not implemented in Python right 
> now. While it would be possible to update combineByKey to do smarter stuff 
> here, one simple workaround you can try is to launch more map tasks (or more 
> reduce tasks). To set the minimum number of map tasks, you can pass it as a 
> second argument to textFile and such (e.g. sc.textFile(“s3n://foo.txt”, 
> 1000)).
>
> Matei
>
> On May 12, 2014, at 5:47 PM, Jim Blomo  wrote:
>
>> Thanks, Aaron, this looks like a good solution!  Will be trying it out 
>> shortly.
>>
>> I noticed that the S3 exception seem to occur more frequently when the
>> box is swapping.  Why is the box swapping?  combineByKey seems to make
>> the assumption that it can fit an entire partition in memory when
>> doing the combineLocally step.  I'm going to try to break this apart
>> but will need some sort of heuristic options include looking at memory
>> usage via the resource module and trying to keep below
>> 'spark.executor.memory', or using batchSize to limit the number of
>> entries in the dictionary.  Let me know if you have any opinions.
>>
>> On Sun, May 4, 2014 at 8:02 PM, Aaron Davidson  wrote:
>>> I'd just like to update this thread by pointing to the PR based on our
>>> initial design: https://github.com/apache/spark/pull/640
>>>
>>> This solution is a little more general and avoids catching IOException
>>> altogether. Long live exception propagation!
>>>
>>>
>>> On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell  wrote:
>>>>
>>>> Hey Jim,
>>>>
>>>> This IOException thing is a general issue that we need to fix and your
>>>> observation is spot-in. There is actually a JIRA for it here I created a 
>>>> few
>>>> days ago:
>>>> https://issues.apache.org/jira/browse/SPARK-1579
>>>>
>>>> Aaron is assigned on that one but not actively working on it, so we'd
>>>> welcome a PR from you on this if you are interested.
>>>>
>>>> The first thought we had was to set a volatile flag when the reader sees
>>>> an exception (indicating there was a failure in the task) and avoid
>>>> swallowing the IOException in the writer if this happens. But I think there
>>>> is a race here where the writer sees the error first before the reader 
>>>> knows
>>>> what is going on.
>>>>
>>>> Anyways maybe if you have a simpler solution you could sketch it out in
>>>> the JIRA and we could talk over there. The current proposal in the JIRA is
>>>> somewhat complicated...
>>>>
>>>> - Patrick
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo  wrote:
>>>>>
>>>>> FYI, it looks like this "stdin writer to Python finished early" error was
>>>>> caused by a break in the connection to S3, from which the data was being
>>>>> pulled.  A recent commit to PythonRDD noted that the current exception
>>>>> catching can potentially mask an exception for the data source, and that 
>>>>> is
>>>>> indeed what I see happening.  The underlying libraries (jets3t and
>>>>> httpclient) do have retry capabilities, but I don't see a great way of
>>>>> setting them through Spark code.  Instead I added the patch below which
>>>>> kills the worker on the exception.  This allows me to completely load the
>>>>> data source after a few worker retries.
>>>>>
>>>>> Unfortunately, java.net.SocketException is the same error that is
>>>>> sometimes expected from the client when using methods like take().  One
>>>>> approach around this conflation is to create a new locally scoped 
>>>>> exception
>>>>> class, eg. WriterException, catch java.net.SocketException during output
>>>>> writing, then re-throw the new exception.  The worker thread could then
>>>>> distinguish between the reasons java.net.SocketException might be thrown.
>>>>> Perhaps there is a more elegant way to do this in Scala, though?
>>>>>
>>>>> Let me know if I should open a ticket or discuss this on the developers
>>>&

Re: Command exited with code 137

2014-06-13 Thread Jim Blomo
I've seen these caused by the OOM killer.  I recommend checking
/var/log/syslog to see if it was activated due to lack of system
memory.

On Thu, Jun 12, 2014 at 11:45 PM, libl <271592...@qq.com> wrote:
> I use standalone mode submit task.But often,I got an error.The stacktrace as
>
> 2014-06-12 11:37:36,578 [INFO] [org.apache.spark.Logging$class]
> [Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-18]
>  - Executor updated: app-20140612092238-0007/0 is now FAILED (Command exited
> with code 137)
> 2014-06-12 11:37:36,670 [INFO] [org.apache.spark.Logging$class]
> [Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-18]
>  - Executor app-20140612092238-0007/0 removed: Command exited with code 137
> 2014-06-12 11:37:36,673 [INFO] [org.apache.spark.Logging$cla0ss]
> [Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-15]
>  - Executor 0 disconnected, so removing it
> 2014-06-12 11:37:36,682 [ERROR] [org.apache.spark.Logging$class]
> [Method:logError] [Line:65] [Thread:spark-akka.actor.default-dispatcher-15]
>  - Lost executor 0 on tj-hadoop-1.certus.com: Unknown executor exit code
> (137) (died from signal 9?)
>
>
> spark config is
> spark_worker_timeout=300
> spark_akka_timeout=500
> spark_akka_frameSize=1000
> spark_akka_num_retries=30
> spark_akka_askTimeout=300
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Command-exited-with-code-137-tp7557.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be
appreciated.  I get the error below when compiling either master or
branch-1.1.  The key error is, I believe, "[ERROR] File name too long"
but I don't understand what it is referring to.  Thanks!


./make-distribution.sh --tgz --skip-java-test -Phadoop-2.4  -Pyarn
-Dyarn.version=2.4.0

[ERROR]
 while compiling:
/home/jblomo/src/spark/core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala
during phase: jvm
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -classpath
/home/jblomo/src/spark/core/target/scala-2.10/test-classes:/home/jblomo/src/spark/core/target/scala-2.10/classes:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/home/jblomo/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/jblomo/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/jblomo/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/jblomo/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/jblomo/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/home/jblomo/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/jblomo/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/jblomo/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/apache/avro/avro/1.7.6/avro-1.7.6.jar:/home/jblomo/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/home/jblomo/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/home/jblomo/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/home/jblomo/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/jblomo/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/jblomo/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/home/jblomo/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/home/jblomo/.m2/repository/commons-codec/commons-codec/1.5/commons-codec-1.5.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/home/jblomo/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-recipes/2.4.0/curator-recipes-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-framework/2.4.0/curator-framework-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-client/2.4.0/curator-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar:/home/jblomo/.m2/repository/jline/jline/0.9.94/jline-0.9.94.jar:/home/jblomo/.m2/repository/o

Re: Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Tracked this down to incompatibility with Scala and encryptfs.
Resolved by compiling in a directory not mounted with encryption (eg
/tmp).

On Thu, Aug 14, 2014 at 3:25 PM, Jim Blomo  wrote:
> Hi, I'm having trouble compiling a snapshot, any advice would be
> appreciated.  I get the error below when compiling either master or
> branch-1.1.  The key error is, I believe, "[ERROR] File name too long"
> but I don't understand what it is referring to.  Thanks!
>
>
> ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.4  -Pyarn
> -Dyarn.version=2.4.0
>
> [ERROR]
>  while compiling:
> /home/jblomo/src/spark/core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala
> during phase: jvm
>  library version: version 2.10.4
> compiler version: version 2.10.4
>   reconstructed args: -classpath
> /home/jblomo/src/spark/core/target/scala-2.10/test-classes:/home/jblomo/src/spark/core/target/scala-2.10/classes:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/home/jblomo/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/jblomo/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/jblomo/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/jblomo/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/jblomo/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/home/jblomo/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/jblomo/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/jblomo/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/apache/avro/avro/1.7.6/avro-1.7.6.jar:/home/jblomo/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/home/jblomo/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/home/jblomo/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/home/jblomo/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/jblomo/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/jblomo/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/home/jblomo/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/home/jblomo/.m2/repository/commons-codec/commons-codec/1.5/commons-codec-1.5.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/home/jblomo/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-recipes/2.4.0/curator-recipes-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-framework/2.4.0/cur