What’s the big size of the raw data and the result data? Is that any other 
changes like HDFS, Spark configuration, your own code etc. besides the Spark 
binary? Can you monitor the IO/CPU state while executing the final stage, and 
it will be great if you can paste the call stack if you observe the high CPU 
utilization.

And can you try not to cache anything and repeat the same step? Just be sure 
it’s not caused by the memory stuff.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Friday, November 6, 2015 12:18 AM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on


---------- Forwarded message ----------
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" <hao.ch...@intel.com<mailto:hao.ch...@intel.com>>

Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in 
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version of 
spark. However, the final step (save as json) is 1 min vs. 2 hour. In my 
opinion, I think it is writing to hdfs cause the slowness of final stage. 
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark 
which runs faster is build on source code of spark 1.4.1. The spark which runs 
slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com<mailto:hao.ch...@intel.com>]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


---------- Forwarded message ----------
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAAAAAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAAAAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIAAAAQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D&X-OWA-CANARY=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen




Reply via email to