Re: What is the location in the source code of the computation of the elements in a map transformation?

2015-05-18 Thread Tom Hubregtsen
Hi Patrick, Thank you very much for your response. I am almost there, but am not sure about my conclusion. Let me try to approach it from a different angle. I would like to time the impact of a particular lambda function, or if possible, more broadly measure the the impact of any map function.

What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Tom Hubregtsen
I am trying to understand what the data and computation flow is in Spark, and believe I fairly understand the Shuffle (both map and reduce side), but I do not get what happens to the computation from the map stages. I know all maps gets pipelined on the shuffle (when there is no other action in

Re: Spilling when not expected

2015-03-13 Thread Tom Hubregtsen
does the web ui say is available? BTW - I don't think any JVM can actually handle 700G heap ... (maybe Zing). On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I

Spilling when not expected

2015-03-12 Thread Tom Hubregtsen
Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I can see I am using a total of 68GB. I've got a terabyte of memory in my system, and set spark.executor.memory 900g spark.driver.memory 900g I use the default for spark.shuffle.memoryFraction

Memory

2014-10-23 Thread Tom Hubregtsen
Hi all, I would like to validate my understanding of memory regions in Spark. Any comments on my description below would be appreciated! Execution is split up into stages, based on wide dependencies between RDDs and actions such as save. All transformations involving narrow dependencies before

Impact of input format on timing

2014-10-05 Thread Tom Hubregtsen
Hi, I ran the same version of a program with two different types of input containing equivalent information. Program 1: 10,000 files with on average 50 IDs, one every line Program 2: 1 file containing 10,000 lines. On average 50 IDs per line My program takes the input, creates key/value pairs

RE: spark.local.dir and spark.worker.dir not used

2014-09-27 Thread Tom Hubregtsen
Also, if I am not mistaken, this data is automatically removed after your run. Be sure to check it while running your program. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-local-dir-and-spark-worker-dir-not-used-tp8529p8578.html Sent from the

Re: memory size for caching RDD

2014-09-27 Thread Tom Hubregtsen
Use unpersist(), even when not persisted before. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/memory-size-for-caching-RDD-tp8256p8579.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.