Sorting data from sequence files is overly memory intensive

2013-12-11 Thread Matt Cheah
the objects retrieved from the sequence files, there's a ton more memory used than just building the objects manually? It doesn't make sense to me. I'm theoretically performing the same operation on both datasets. Thanks, I'd definitely appreciate the help! -Matt Cheah From: Andrew Winings mch

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
there will be. Thanks, -Matt Cheah

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
will be (presumably proportional to the size of the dataset). Thanks for the quick response! -Matt Cheah From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org user@spark.incubator.apache.orgmailto:user

Re: groupBy() with really big groups fails

2013-12-09 Thread Matt Cheah
Thanks a lot for that. There's definitely a lot of subtleties that we need to consider. We appreciate the thorough explanation! -Matt Cheah From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org user

Biggest spark.akka.framesize possible

2013-12-07 Thread Matt Cheah
the ramifications of turning up this value, but I was wondering what the actual maximum number that could be set for it is. I'll benchmark the performance hit accordingly. Thanks! -Matt Cheah

Re: takeSample() computation

2013-12-05 Thread Matt Cheah
Actually, we want the opposite – we want as much data to be computed as possible. It's only for benchmarking purposes, of course. -Matt Cheah From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Matt Cheah
I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes. I'm still pretty new to cluster computing, so just not sure how people have set these up. -Matt Cheah

Re: Serializable incompatible with Externalizable error

2013-12-03 Thread Matt Cheah
to me that I'd have to do so. Especially since the tuning guide suggests to use Externalizable: http://spark.incubator.apache.org/docs/latest/tuning.html -Matt Cheah From: Andrew Ash and...@andrewash.commailto:and...@andrewash.com Reply-To: user@spark.incubator.apache.orgmailto:user

Serializable incompatible with Externalizable error

2013-12-02 Thread Matt Cheah
:153) I'm running on a spark cluster generated by the EC2 Scripts. This doesn't happen if I'm running things with local[N]. Any ideas? Thanks, -Matt Cheah

Re: Multiple SparkContexts in one JVM

2013-11-20 Thread Matt Cheah
to create a SparkContext per compute-session to sandbox the jars in each user's job. Is this a use case that could be done by only using one SparkContext in the JVM? -Matt Cheah From: Dmitriy Lyubimov dlie...@gmail.commailto:dlie...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user

EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Matt Cheah
EC2 nodes could have their firewalls configured to allow this. We don't want to deploy the web server on the master node of the spark cluster. Thanks, -Matt Cheah

Take last k elements from RDD?

2013-10-24 Thread Matt Cheah
Hi everyone, I see there is a take() function for RDDs, getting the first n elements. Is there a way to get the last n elements? Thanks, -Matt Cheah

Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
cases, we get a stack trace (running locally with 3 threads). I've included the stack trace below. Thanks, -Matt Cheah org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:84

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
reduce functions need to be associative and commutative. On Tue, Oct 22, 2013 at 12:28 PM, Matt Cheah mch...@palantir.commailto:mch...@palantir.com wrote: Hi everyone, I have a driver holding a reference to an RDD. The driver would like to visit each item in the RDD in order, say

Re: Visitor function to RDD elements

2013-10-22 Thread Matt Cheah
out again to get this sequential behavior. I appreciate the discussion though. Quite enlightening. Thanks, -Matt Cheah From: Christopher Nguyen c...@adatao.commailto:c...@adatao.com Date: Tuesday, October 22, 2013 2:23 PM To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org

Re: RDD sample fraction precision

2013-10-21 Thread Matt Cheah
Ah, I misunderstood the functionality then – I was under the impression that exactly that fraction would be returned. Thanks, -Matt Cheah From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Reply-To: user@spark.incubator.apache.orgmailto:user@spark.incubator.apache.org user