---------- Forwarded message ----------
From: gen tang <gen.tan...@gmail.com>
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" <hao.ch...@intel.com>


Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version
of spark. However, the final step (save as json) is 1 min vs. 2 hour. In my
opinion, I think it is writing to hdfs cause the slowness of final stage.
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark
which runs faster is build on source code of spark 1.4.1. The spark which
runs slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more
> information on the ETL work?
>
>
>
> *From:* Cheng, Hao [mailto:hao.ch...@intel.com]
> *Sent:* Thursday, November 5, 2015 12:56 PM
> *To:* gen tang; dev@spark.apache.org
> *Subject:* RE: dataframe slow down with tungsten turn on
>
>
>
> 1.5 has critical performance / bug issues, you’d better try 1.5.1 or
> 1.5.2rc version.
>
>
>
> *From:* gen tang [mailto:gen.tan...@gmail.com <gen.tan...@gmail.com>]
> *Sent:* Thursday, November 5, 2015 12:43 PM
> *To:* dev@spark.apache.org
> *Subject:* Fwd: dataframe slow down with tungsten turn on
>
>
>
> Hi,
>
>
>
> In fact, I tested the same code with spark 1.5 with tungsten turning off.
> The result is quite the same as tungsten turning on.
>
> It seems that it is not the problem of tungsten, it is simply that spark
> 1.5 is slower than spark 1.4.
>
>
>
> Is there any idea about why it happens?
>
> Thanks a lot in advance
>
>
>
> Cheers
>
> Gen
>
>
>
>
>
> ---------- Forwarded message ----------
> From: *gen tang* <gen.tan...@gmail.com>
> Date: Wed, Nov 4, 2015 at 3:54 PM
> Subject: dataframe slow down with tungsten turn on
> To: "u...@spark.apache.org" <u...@spark.apache.org>
>
> Hi sparkers,
>
>
>
> I am using dataframe to do some large ETL jobs.
>
> More precisely, I create dataframe from HIVE table and do some operations.
> And then I save it as json.
>
>
>
> When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
> However, when I use the same code with spark-1.5.1(with tungsten turn on),
> it takes a about 2 hours to finish the same job.
>
>
>
> I checked the detail of tasks, almost all the time is consumed by
> computation.
>
> Any idea about why this happens?
>
>
>
> Thanks a lot in advance for your help.
>
>
>
> Cheers
>
> Gen
>
>
>
>
>

Reply via email to