So increasing Executors without increasing physical resources
If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
and give number of executors as 8, then I am increasing the resource right?
In this case, how do you explain?
Thank You
On Sun, Feb 22, 2015 at 6:12 AM, Aaron
Has anyone done any work on that?
On Sun, Feb 22, 2015 at 9:57 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Yes, exactly.
On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski
ognen.duzlev...@gmail.com wrote:
On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Note that the parallelism (i.e., number of partitions) is just an upper
bound on how much of the work can be done in parallel. If you have 200
partitions, then you can divide the work among between 1 and 200 cores and
all resources will remain utilized. If you have more than 200 cores,
though,
Hi all,
I had a streaming application and midway through things decided to up the
executor memory. I spent a long time launching like this:
~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest
--executor-memory 2G --master...
and observing the executor memory is still at old 512
Yes, exactly.
On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
wrote:
On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
No, I am talking about some work parallel to prediction works that are
done on GPUs. Like say, given the data for
Also, If I take SparkPageRank for example (org.apache.spark.examples),
there are various RDDs that are created and transformed in the code that is
written. If I want to increase the number of partitions and test out, what
is the optimum number of partitions that gives me the best performance, I
Yes. As my understanding, it would allow me to write SQLs to query a spark
context. But, the query needs to be specified within a job deployed.
What I want is to be able to run multiple dynamic queries specified at
runtime from a dashboard.
--
Nikhil Bafna
On Sat, Feb 21, 2015 at 8:37 PM,
On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
No, I am talking about some work parallel to prediction works that are
done on GPUs. Like say, given the data for smaller number of nodes in a
Spark cluster, the prediction needs to be done about the time that the
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec
You can use it by setting mapred.output.committer.class in the Hadoop
configuration (or spark.hadoop.mapred.output.committer.class in the Spark
configuration). Note that this only works for the old Hadoop APIs, I
believe the
Spark won't listen on mate, It basically means you have a flume source
running at port of your localhost. And when you submit your
application in standalone mode, workers will consume date from that port.
Thanks
Best Regards
On Sat, Feb 21, 2015 at 9:22 AM, bit1...@163.com
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair
being the key) would work - looks like a mapReduce with combiners
problem. I think reduceByKey would use combiners while aggregateByKey
wouldn't.
- Could we optimize this further by using combineByKey directly ?
How many executors you have per machine? It will be helpful if you
could list all the configs.
Could you also try to run it without persist? Caching do hurt than
help, if you don't have enough memory.
On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
Thanks for the
I think the cheapest possible way to force materialization is something like
rdd.foreachPartition(i = None)
I get the use case, but as you can see there is a cost: you are forced
to materialize an RDD and cache it just to measure the computation
time. In principle this could be taking
I agree with your assessment as to why it *doesn't* just work. I don't
think a small batch duration helps as all files it sees at the outset
are processed in one batch. Your timestamps are a user-space concept
not a framework concept.
However, there ought to be a great deal of reusability between
Can you be a bit more specific ?
Are you asking about performance across Spark releases ?
Cheers
On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Has some performance prediction work been done on Spark?
Thank You
In this case, I just wanted to know if a single node cluster with various
workers act like a simulator of a multi-node cluster with various nodes.
Like, if we have a single node cluster with 10 workers, say, then can we
tell that the same behavior will take place with cluster of 10 nodes?
It is
There could be many different things causing this. For example, if you only
have a single partition of data, increasing the number of tasks will only
increase execution time due to higher scheduling overhead. Additionally, how
large is a single partition in your application relative to the
So, with the increase in the number of worker instances, if I also increase
the degree of parallelism, will it make any difference?
I can use this model even the other way round right? I can always predict
the performance of an app with the increase in number of worker instances,
the deterioration
Yes, I am talking about standalone single node cluster.
No, I am not increasing parallelism. I just wanted to know if it is
natural. Does message passing across the workers account for the happenning?
I am running SparkKMeans, just to validate one prediction model. I am using
several data sets.
No, I am talking about some work parallel to prediction works that are done
on GPUs. Like say, given the data for smaller number of nodes in a Spark
cluster, the prediction needs to be done about the time that the
application would take when we have larger number of nodes.
On Sat, Feb 21, 2015 at
What's your storage like? are you adding worker machines that are
remote from where the data lives? I wonder if it just means you are
spending more and more time sending the data over the network as you
try to ship more of it to more remote workers.
To answer your question, no in general more
Have you looked at
http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
?
Cheers
On Sat, Feb 21, 2015 at 4:24 AM, Nikhil Bafna nikhil.ba...@flipkart.com
wrote:
Hi.
My use case is building a realtime monitoring system over
multi-dimensional data.
The way
No, I just have a single node standalone cluster.
I am not tweaking around with the code to increase parallelism. I am just
running SparkKMeans that is there in Spark-1.0.0
I just wanted to know, if this behavior is natural. And if so, what causes
this?
Thank you
On Sat, Feb 21, 2015 at 8:32
Workers has a specific meaning in Spark. You are running many on one
machine? that's possible but not usual.
Each worker's executors have access to a fraction of your machine's
resources then. If you're not increasing parallelism, maybe you're not
actually using additional workers, so are using
Yes, I have decreased the executor memory.
But,if I have to do this, then I have to tweak around with the code
corresponding to each configuration right?
On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:
Workers has a specific meaning in Spark. You are running many on one
Are you replicating any RDDs?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Filesystem-closed-tp20150p21749.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
So, if I keep the number of instances constant and increase the degree of
parallelism in steps, can I expect the performance to increase?
Thank You
On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
So, with the increase in the number of worker instances, if I also
For large jobs, the following error message is shown that seems to indicate
that shuffle files for some reason are missing. It's a rather large job
with many partitions. If the data size is reduced, the problem disappears.
I'm running a build from Spark master post 1.2 (build at 2015-01-16) and
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a
Hi.
My use case is building a realtime monitoring system over multi-dimensional
data.
The way I'm planning to go about it is to use Spark Streaming to store
aggregated count over all dimensions in 10 sec interval.
Then, from a dashboard, I would be able to specify a query over some
dimensions,
Hi,
Has some performance prediction work been done on Spark?
Thank You
I can imagine a few reasons. Adding workers might cause fewer tasks to
execute locally (?) So you may be execute more remotely.
Are you increasing parallelism? for trivial jobs, chopping them up
further may cause you to pay more overhead of managing so many small
tasks, for no speed up in
Hi,
I have been running some jobs in my local single node stand alone cluster.
I am varying the worker instances for the same job, and the time taken for
the job to complete increases with increase in the number of workers. I
repeated some experiments varying the number of nodes in a cluster too
Hi,
I have experienced the same behavior. You are talking about standalone
cluster mode right?
BR
On 21 February 2015 at 14:37, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I have been running some jobs in my local single node stand alone cluster.
I am varying the worker instances for
The message went through after all. Sorry for spamming.
On 21.2.2015. 21:27, pzecevic wrote:
Hi Spark users.
Does anybody know what are the steps required to be able to post to this
list by sending an email to user@spark.apache.org? I just sent a reply to
Corey Nolet's mail Missing shuffle
Can someone share some ideas about how to tune the GC time?
Thanks
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Spark performance tuning
Date: Fri, 20 Feb 2015 16:04:23 -0500
Hi,
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I
setup a
Hi Spark users.
Does anybody know what are the steps required to be able to post to this
list by sending an email to user@spark.apache.org? I just sent a reply to
Corey Nolet's mail Missing shuffle files but I don't think it was accepted
by the engine.
If I look at the Spark user list, I don't
Could you try to turn on the external shuffle service?
spark.shuffle.service.enable= true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out
how they are dying. I'm
Josh is that class something you guys would consider open sourcing, or
would you rather the community step up and create an OutputCommitter
implementation optimized for S3?
On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen rosenvi...@gmail.com wrote:
We (Databricks) use our own DirectOutputCommitter
39 matches
Mail list logo