Spot instances on Amazon EMR

2014-09-18 Thread Grzegorz Białek
Hi, I would like to run Spark application on Amazon EMR. I have some questions about that: 1. I have input data on other hdfs (not on Amazon). Can I send all input data from that cluster to HDFS on Amazon EMR cluster (if it has enough storage memory) or do I have send it to Amazon S3 storage and

Re: Viewing web UI after fact

2014-09-15 Thread Grzegorz Białek
the application, though the information displayed will be incomplete because the log did not capture all the events (sc.stop() does a final close() on the file written). Andrew 2014-09-05 1:50 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com: Hi Andrew, thank you very much for your answer

Re: Viewing web UI after fact

2014-09-05 Thread Grzegorz Białek
you're running. To resolve any ambiguity, you may set the log path to file:/tmp/spark-events instead. But first verify whether they actually exist. Let me know if you get it working, -Andrew 2014-08-19 8:23 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com: Hi, Is there any way view

spark.default.parallelism bug?

2014-08-26 Thread Grzegorz Białek
Hi, consider the following code: import org.apache.spark.{SparkContext, SparkConf} object ParallelismBug extends App { var sConf = new SparkConf() .setMaster(spark://hostName:7077) // .setMaster(local[4]) .set(spark.default.parallelism, 7) // or without it val sc = new

Prevent too many partitions

2014-08-26 Thread Grzegorz Białek
Hi, I have in my application many union operations. But union increases number of partitions of following RDDs. And performance on more partitions sometimes is very slow. Is there any cleaner way to prevent increasing number of partitions than adding coalesce(numPartitions) after each union?

Re: Advantage of using cache()

2014-08-21 Thread Grzegorz Białek
number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were

Tracking memory usage

2014-08-21 Thread Grzegorz Białek
Hi, I would like to ask how to check how much memory of executor was used during run of application. I know where to check cache memory usage in logs and in web UI (in Storage tab), but where can I check size of rest of the heap (used e.g. for aggregation and cogroups during shuffle)? Because it

Web UI doesn't show some stages

2014-08-20 Thread Grzegorz Białek
Hi, I am wondering why in web UI some stages (like join, filter) are not visible. For example this code: val simple = sc.parallelize(Array.range(0,100)) val simple2 = sc.parallelize(Array.range(0,100)) val toJoin = simple.map(x = (x, x.toString + x.toString)) val rdd = simple2 .map(x =

Advantage of using cache()

2014-08-20 Thread Grzegorz Białek
Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were similar. Could help me with this issue? I tried to compute rdd and use it later in two places and I thought in second usage this rdd is recomputed but it doesn't:

Re: Viewing web UI after fact

2014-08-19 Thread Grzegorz Białek
but I couldn't find any. Or maybe I'm doing something wrong launching history server. Do you have any idea how to solve it? Thanks, Grzegorz On Thu, Aug 14, 2014 at 10:53 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, Thank you both for your answers. Browsing using Master UI works

Killing spark app problem

2014-08-12 Thread Grzegorz Białek
Hi, when I run some spark application on my local machine using spark-submit: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar When I want to interrupt computing by ctrl-c it interrupt current stage but later it waits and exit after around 5min and sometimes doesn't exit at all, and the

Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
Hi, I ran Spark application in local mode with command: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar with set master=local. After around 10 minutes of computing it started to slow down significantly that next stage took around 50 minutes and next after 5 hours in 80% done and CPU

Re: Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
I'm using Spark 1.0.0 On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I ran Spark application in local mode with command: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar with set master=local. After around 10 minutes of computing

Re: Setting spark.executor.memory problem

2014-08-06 Thread Grzegorz Białek
Hi Andrew, Thank you very much for your solution, it works like a charm, and for very clear explanation. Grzegorz

Setting spark.executor.memory problem

2014-08-05 Thread Grzegorz Białek
Hi, I wanted to make simple Spark app running in local mode with 2g spark.executor.memory and 1g for caching. But following code: val conf = new SparkConf() .setMaster(local) .setAppName(app) .set(spark.executor.memory, 2g) .set(spark.storage.memoryFraction, 0.5) val sc = new

master=local vs master=local[*]

2014-08-05 Thread Grzegorz Białek
Hi, I have Spark application which computes join of two RDDs. One contains around 150MB of data (7 million entries) second around 1,5MB (80 thousand entries) and result of this join contains 50MB of data (2 million entries). When I run it on one core (with master=local) it works correctly (whole