Hi,
I would like to run Spark application on Amazon EMR. I have some questions
about that:
1. I have input data on other hdfs (not on Amazon). Can I send all input
data from that cluster to HDFS on Amazon EMR cluster (if it has enough
storage memory) or do I have send it to Amazon S3 storage and
the application, though the
information displayed will be incomplete because the log did not capture
all the events (sc.stop() does a final close() on the file written).
Andrew
2014-09-05 1:50 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com:
Hi Andrew,
thank you very much for your answer
you're running. To resolve any
ambiguity, you may set the log path to file:/tmp/spark-events instead.
But first verify whether they actually exist.
Let me know if you get it working,
-Andrew
2014-08-19 8:23 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com:
Hi,
Is there any way view
Hi,
consider the following code:
import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
var sConf = new SparkConf()
.setMaster(spark://hostName:7077) // .setMaster(local[4])
.set(spark.default.parallelism, 7) // or without it
val sc = new
Hi,
I have in my application many union operations. But union increases number
of partitions of following RDDs. And performance on more partitions
sometimes is very slow. Is there any cleaner way to prevent increasing
number of partitions than adding
coalesce(numPartitions) after each union?
number generation. So it
will be hard to isolate the effect of caching.
On Wed, Aug 20, 2014 at 7:48 AM, Grzegorz Białek
grzegorz.bia...@codilime.com wrote:
Hi,
I tried to write small program which shows that using cache() can speed
up execution but results with and without cache were
Hi,
I would like to ask how to check how much memory of executor was used
during run of application. I know where to check cache memory usage in logs
and in web UI (in Storage tab), but where can I check size of rest of the
heap (used e.g. for aggregation and cogroups during shuffle)? Because it
Hi,
I am wondering why in web UI some stages (like join, filter) are not
visible. For example this code:
val simple = sc.parallelize(Array.range(0,100))
val simple2 = sc.parallelize(Array.range(0,100))
val toJoin = simple.map(x = (x, x.toString + x.toString))
val rdd = simple2
.map(x =
Hi,
I tried to write small program which shows that using cache() can speed up
execution but results with and without cache were similar. Could help me
with this issue? I tried to compute rdd and use it later in two places and
I thought in second usage this rdd is recomputed but it doesn't:
but I
couldn't find any. Or maybe I'm doing something wrong launching history
server.
Do you have any idea how to solve it?
Thanks,
Grzegorz
On Thu, Aug 14, 2014 at 10:53 AM, Grzegorz Białek
grzegorz.bia...@codilime.com wrote:
Hi,
Thank you both for your answers. Browsing using Master UI works
Hi,
when I run some spark application on my local machine using spark-submit:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
When I want to interrupt computing by ctrl-c it interrupt current stage but
later it waits and exit after around 5min and sometimes doesn't exit at all,
and the
Hi,
I ran Spark application in local mode with command:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
with set master=local.
After around 10 minutes of computing it started to slow down
significantly that next stage took around 50 minutes and next after 5 hours
in 80%
done and CPU
I'm using Spark 1.0.0
On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek
grzegorz.bia...@codilime.com wrote:
Hi,
I ran Spark application in local mode with command:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
with set master=local.
After around 10 minutes of computing
Hi Andrew,
Thank you very much for your solution, it works like a charm, and for very
clear explanation.
Grzegorz
Hi,
I wanted to make simple Spark app running in local mode with 2g
spark.executor.memory and 1g for caching. But following code:
val conf = new SparkConf()
.setMaster(local)
.setAppName(app)
.set(spark.executor.memory, 2g)
.set(spark.storage.memoryFraction, 0.5)
val sc = new
Hi,
I have Spark application which computes join of two RDDs. One contains
around 150MB of data (7 million entries) second around 1,5MB (80 thousand
entries) and
result of this join contains 50MB of data (2 million entries).
When I run it on one core (with master=local) it works correctly (whole
16 matches
Mail list logo