Hi Bill,
You can try DirectStream and increase # of partition to kafka. then input
Dstream will have the partitions as per kafka topic without using
re-partitioning.
Can you please share your event timeline chart from spark ui. You need to
tune your configuration as per computation. Spark ui
Hi Talebzadeh,
Thank for your quick response.
>>in 1.6, how many executors do you see for each node?
I have1 executor for 1 node with SPARK_WORKER_INSTANCES=1.
>>in standalone mode how are you increasing the number of worker instances.
Are you starting another slave on each node?
No, I am not
Hi,
in 1.6, how many executors do you see for each node?
in standalone mode how are you increasing the number of worker instances.
Are you starting another slave on each node?
HTH
Dr Mich Talebzadeh
LinkedIn *
6 kafka partitions will result in 6 spark partitions, not 6 spark rdds.
The question of whether you will have a backlog isn't just a matter of
having 1 executor per partition. If a single executor can process all of
the partitions fast enough to complete a batch in under the required time,
you
The equivalent for spark-submit --num-executors should beĀ
spark.executor.instancesWhen use in
SparkConf?http://spark.apache.org/docs/latest/running-on-yarn.html
Could you try setting that with sparkR.init()?
_
From: Franc Carter
Sent:
Thanks, that works
cheers
On 26 December 2015 at 16:53, Felix Cheung
wrote:
> The equivalent for spark-submit --num-executors should be
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> Could you try
Oh BTW, it's spark 1.3.1 on hadoop 2.4. AIM 3.6.
Sorry for lefting out this information.
Appreciate for any help!
Ed
2015-05-18 12:53 GMT-04:00 edward cui edwardcu...@gmail.com:
I actually have the same problem, but I am not sure whether it is a spark
problem or a Yarn problem.
I set up a
*All
On Mon, May 18, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Xiaohe,
The all Spark options must go before the jar or they won't take effect.
-Sandy
On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com
wrote:
Sorry, them both are assigned task
Hi Xiaohe,
The all Spark options must go before the jar or they won't take effect.
-Sandy
On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote:
Sorry, them both are assigned task actually.
Aggregated Metrics by Executor
Executor IDAddressTask TimeTotal TasksFailed
I actually have the same problem, but I am not sure whether it is a spark
problem or a Yarn problem.
I set up a five nodes cluster on aws emr, start yarn daemon on the master
(The node manager will not be started on default on the master, I don't
want to waste any resource since I have to pay).
Yeah, I read that page before, but it does not mention the options should
come before the application jar. Actually, if I put the --class option
before the application jar, I will get ClassNotFoundException.
Anyway, thanks again Sandy.
On Tue, May 19, 2015 at 11:06 AM, Sandy Ryza
Hi Sandy,
Thanks for your information. Yes, spark-submit --master yarn
--num-executors 5 --executor-cores 4
target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp is
working awesomely. Is there any documentations pointing to this ?
Thanks,
Xiaohe
On Tue, May 19, 2015 at 12:07 AM,
Awesome!
It's documented here:
https://spark.apache.org/docs/latest/submitting-applications.html
-Sandy
On Mon, May 18, 2015 at 8:03 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi Sandy,
Thanks for your information. Yes, spark-submit --master yarn
--num-executors 5 --executor-cores 4
Did you try --executor-cores param? While you submit the job, do a ps aux |
grep spark-submit and see the exact command parameters.
Thanks
Best Regards
On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit
Sorry, them both are assigned task actually.
Aggregated Metrics by Executor
Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInput Size
/ RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill
(Disk)1host1:61841.7 min505640.0 MB / 12318400382.3 MB / 121007701630.4
bash-4.1$ ps aux | grep SparkSubmit
xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13
/scratch/xilan/jdk1.8.0_45/bin/java -cp
What Spark release are you using ?
Can you check driver log to see if there is some clue there ?
Thanks
On Sat, May 16, 2015 at 12:01 AM, xiaohe lan zombiexco...@gmail.com wrote:
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app.
spark-submit --master yarn
Hello!
Thank you very much for your response. In the book Learning Spark I
found out the following sentence:
Each application will have at most one executor on each worker
So worker can have one or none executor process spawned (perhaps the number
depends on the workload distribution).
Best
Hi Spico,
Yes, I think an executor core in Spark is basically a thread in a worker
pool. It's recommended to have one executor core per physical core on your
machine for best performance, but I think in theory you can create as many
threads as your OS allows.
For deployment:
There seems to be
1. On HDFS files are treated as ~64mb in block size. When you put the same
file in local file system (ext3/ext4) it will be treated as different (in
your case it looks like ~32mb) and that's why you are seeing 9 output files.
2. You could set *num-executors *to increase the number of executor
This one would give you a better understanding
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores-vs-the-number-of-executors
Thanks
Best Regards
On Wed, Nov 26, 2014 at 10:32 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
1. On HDFS files are treated as ~64mb in
Hi Tathagata,
I have tried the repartition method. The reduce stage first had 2 executors
and then it had around 85 executors. I specified repartition(300) and each
of the executors were specified 2 cores when I submitted the job. This
shows repartition works to increase more executors. However,
Hi Tathagata,
It seems repartition does not necessarily force Spark to distribute the
data into different executors. I have launched a new job which uses
repartition right after I received data from Kafka. For the first two
batches, the reduce stage used more than 80 executors. Starting from the
Can you give me a screen shot of the stages page in the web ui, the spark
logs, and the code that is causing this behavior. This seems quite weird to
me.
TD
On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
It seems repartition does not necessarily
If I understand correctly, you could not change the number of executors at
runtime right(correct me if am wrong) - its defined when we start the
application and fixed. Do you mean number of tasks?
On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Can you try
Hi Praveen,
I did not change the number of total executors. I specified 300 as the
number of executors when I submitted the jobs. However, for some stages,
the number of executors is very small, leading to long calculation time
even for small data set. That means not all executors were used for
Hi Tathagata,
I also tried to use the number of partitions as parameters to the functions
such as groupByKey. It seems the numbers of executors is around 50 instead
of 300, which is the number of the executors I specified in submission
script. Moreover, the running time of different executors is
Can you show us the program that you are running. If you are setting number
of partitions in the XYZ-ByKey operation as 300, then there should be 300
tasks for that stage, distributed on the 50 executors are allocated to your
context. However the data distribution may be skewed in which case, you
Hi Tathagata,
Below is my main function. I omit some filtering and data conversion
functions. These functions are just a one-to-one mapping, which may not
possible increase running time. The only reduce function I have here is
groupByKey. There are 4 topics in my Kafka brokers and two of the
Hi folks,
I just ran another job that only received data from Kafka, did some
filtering, and then save as text files in HDFS. There was no reducing work
involved. Surprisingly, the number of executors for the saveAsTextFiles
stage was also 2 although I specified 300 executors in the job
Aah, I get it now. That is because the input data streams is replicated on
two machines, so by locality the data is processed on those two machines.
So the map stage on the data uses 2 executors, but the reduce stage,
(after groupByKey) the saveAsTextFiles would use 300 tasks. And the default
Hi Tathagata,
Do you mean that the data is not shuffled until the reduce stage? That
means groupBy still only uses 2 machines?
I think I used repartition(300) after I read the data from Kafka into
DStream. It seems that it did not guarantee that the map or reduce stages
will be run on 300
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.
TD
On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I have a
Hi Tathagata,
I set default parallelism as 300 in my configuration file. Sometimes there
are more executors in a job. However, it is still slow. And I further
observed that most executors take less than 20 seconds but two of them take
much longer such as 2 minutes. The data size is very small
Can you try setting the number-of-partitions in all the shuffle-based
DStream operations, explicitly. It may be the case that the default
parallelism (that is, spark.default.parallelism) is probably not being
respected.
Regarding the unusual delay, I would look at the task details of that stage
35 matches
Mail list logo