But I still have one question. I find the task number in stage is 3. So
where is this 3 from? How to increase the parallelism?
Regard,
Junfeng Chen
On Tue, Apr 10, 2018 at 11:31 AM, Junfeng Chen wrote:
> Yeah, I have increase the executor number and executor cores, and it
Yeah, I have increase the executor number and executor cores, and it runs
normally now. The hdp spark 2 have only 2 executor and 1 executor cores by
default.
Regard,
Junfeng Chen
On Tue, Apr 10, 2018 at 10:19 AM, Saisai Shao
wrote:
> In yarn mode, only two executor
>
> In yarn mode, only two executor are assigned to process the task, since
> one executor can process one task only, they need 6 min in total.
>
This is not true. You should set --executor-cores/--num-executors to
increase the task parallelism for executor. To be fair, Spark application
should
I found the potential reason.
In local mode, all tasks in one stage runs concurrently, while tasks in
yarn mode runs in sequence.
For example, in one stage, each task costs 3 mins. If in local mode, they
will run together, and cost 3 min in total. In yarn mode, only two executor
are assigned to
Hi Jorn,
I checked the log info of my application:
The ResultStage3 (parquet writing) cost a very long time,nearly 300s, where
the total processing time of this loop is 6 mins.
Regard,
Junfeng Chen
On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke wrote:
> Probably network /
hi,
My kafka topic has three partitions. The time cost I mentioned means ,
each streaming loop cost more time with yarn client mode. For example yarn
mode cost 300 seconds to process some data, and local mode just cost 200
seconds to process similar amount of data.
Regard,
Junfeng Chen
On
I read json string value from kafka, then transform them to df:
Dataset df = spark.read().json(stringjavaRDD);
Then add some new data to each row:
> JavaRDD rowJavaRDD = df.javaRDD().map(...)
> StructType type = df.schema().add()
> Dataset newdf = spark.createDataFrame(rowJavaRDD,type);
Hi Junfeng ,
Is your kafka topic partitioned?
Are you referring to the duration or the CPU time spent by the job as being 20%
- 50% higher than running in local?
Thanks & Regards
Gopal
> On 09-Apr-2018, at 11:42 AM, Jörn Franke wrote:
>
> Probably network /
Probably network / shuffling cost? Or broadcast variables? Can you provide more
details what you do and some timings?
> On 9. Apr 2018, at 07:07, Junfeng Chen wrote:
>
> I have wrote an spark streaming application reading kafka data and convert
> the json data to parquet
I have wrote an spark streaming application reading kafka data and convert
the json data to parquet and save to hdfs.
What make me puzzled is, the processing time of app in yarn mode cost 20%
to 50% more time than in local mode. My cluster have three nodes with three
node managers, and all three
10 matches
Mail list logo