Should add that I had to tweak the numbers a bit to keep above swap
threshold, but below the Too many open files error (`ulimit -n` is
32768).
On Wed, May 14, 2014 at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote:
That worked amazingly well, thank you Matei! Numbers that worked for
me were 400
That worked amazingly well, thank you Matei! Numbers that worked for
me were 400 for the textFile()s, 1500 for the join()s.
On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Jim, unfortunately external spilling is not implemented in Python right
now. While it
Cool, that’s good to hear. We’d also like to add spilling in Python itself, or
at least make it exit with a good message if it can’t do it.
Matei
On May 14, 2014, at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote:
That worked amazingly well, thank you Matei! Numbers that worked for
me were
Thanks, Aaron, this looks like a good solution! Will be trying it out shortly.
I noticed that the S3 exception seem to occur more frequently when the
box is swapping. Why is the box swapping? combineByKey seems to make
the assumption that it can fit an entire partition in memory when
doing the
Hey Jim, unfortunately external spilling is not implemented in Python right
now. While it would be possible to update combineByKey to do smarter stuff
here, one simple workaround you can try is to launch more map tasks (or more
reduce tasks). To set the minimum number of map tasks, you can pass
I'd just like to update this thread by pointing to the PR based on our
initial design: https://github.com/apache/spark/pull/640
This solution is a little more general and avoids catching IOException
altogether. Long live exception propagation!
On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell
Hey Jim,
This IOException thing is a general issue that we need to fix and your
observation is spot-in. There is actually a JIRA for it here I created a
few days ago:
https://issues.apache.org/jira/browse/SPARK-1579
Aaron is assigned on that one but not actively working on it, so we'd
welcome a
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
0.9.0:
- https://issues.apache.org/jira/browse/SPARK-1323
SNAPSHOT 2014-03-18:
- When persist() used and batchSize=1, java.lang.OutOfMemoryError:
Java heap space. To me this indicates a memory
Okay, thanks. Do you have any info on how large your records and data file are?
I’d like to reproduce and fix this.
Matei
On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
This dataset is uncompressed text at ~54GB. stats() returns (count:
56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
343)
On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Okay, thanks. Do you have any info on how large your records and data file
I've only tried 0.9, in which I ran into the `stdin writer to Python
finished early` so frequently I wasn't able to load even a 1GB file.
Let me know if I can provide any other info!
On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
I see, did this also fail with
I think the problem I ran into in 0.9 is covered in
https://issues.apache.org/jira/browse/SPARK-1323
When I kill the python process, the stacktrace I gets indicates that
this happens at initialization. It looks like the initial write to
the Python process does not go through, and then the
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll
try to look into these, seems like a serious error.
Matei
On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote:
Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop
1.0.4 from GitHub on
Hi all, I'm wondering if there's any settings I can use to reduce the
memory needed by the PythonRDD when computing simple stats. I am
getting OutOfMemoryError exceptions while calculating count() on big,
but not absurd, records. It seems like PythonRDD is trying to keep
too many of these
14 matches
Mail list logo