I read json string value from kafka, then transform them to df:

Dataset<Row> df = spark.read().json(stringjavaRDD);


Then add some new data to each row:

> JavaRDD<Row> rowJavaRDD = df.javaRDD().map(...)
> StructType type = df.schema().add()....
> Dataset<Row> newdf = spark.createDataFrame(rowJavaRDD,type);


...

At last write the dataset to parquet file

newdf.write().mode(SaveMode.Append).partitionedBy("stream","appname","year","month","day","hour").parquet(savePath);


How to determine if it is caused by shuffle or broadcast?


Regard,
Junfeng Chen

On Mon, Apr 9, 2018 at 2:12 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Probably network / shuffling cost? Or broadcast variables? Can you provide
> more details what you do and some timings?
>
> > On 9. Apr 2018, at 07:07, Junfeng Chen <darou...@gmail.com> wrote:
> >
> > I have wrote an spark streaming application reading kafka data and
> convert the json data to parquet and save to hdfs.
> > What make me puzzled is, the processing time of app in yarn mode cost
> 20% to 50% more time than in local mode. My cluster have three nodes with
> three node managers, and all three hosts have same hardware, 40cores and
> 256GB memory. .
> >
> > Why? How to solve it?
> >
> > Regard,
> > Junfeng Chen
>

Reply via email to