Can you check in the UI which tasks took most of the time?

Even the 45 min looks a little bit much given that you only work most of the 
time with 50k rows

> On 15 Feb 2017, at 00:03, Timur Shenkao <t...@timshenkao.su> wrote:
> 
> Hello,
> I'm not sure that's your reason but check this discussion:
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html
> 
>> On Tue, Feb 14, 2017 at 9:25 PM, anatva <arun.na...@gmail.com> wrote:
>> Hi,
>> I am reading an ORC file, and perform some joins, aggregations and finally
>> generate a dense vector to perform analytics.
>> 
>> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same
>> code is migrated to run on spark 2.0 on the same cluster, it takes around
>> 4-5 hours. It is surprising and frustrating.
>> 
>> Can anyone take a look and help me what should I change in order to get
>> atleast same performance in spark 2.0.
>> 
>> spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G \
>> --num-executors 15 --executor-cores 5 --queue grid_analytics --conf
>> spark.yarn.executor.memoryOverhead=5120 --conf
>> spark.executor.heartbeatInterval=1200
>> 
>> import sqlContext.implicits._
>> import org.apache.spark.storage.StorageLevel
>> import org.apache.spark.sql.functions.lit
>> import org.apache.hadoop._
>> import org.apache.spark.sql.functions.udf
>> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>> import org.apache.spark.ml.clustering.KMeans
>> import org.apache.spark.ml.feature.StandardScaler
>> import org.apache.spark.ml.feature.Normalizer
>> 
>> here are the steps:
>> 1. read orc file
>> 2. filter some of the records
>> 3. persist resulting data frame
>> 4. get distinct accounts from the df and get a sample of 50k accts from the
>> distinct list
>> 5. join the above data frame with distinct 50k accounts to pull records for
>> only those 50k accts
>> 6. perform a group by to get the avg, mean, sum, count of readings for the
>> given 50k accts
>> 7. join the df obtained by GROUPBY with original DF
>> 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a
>> DENSE VECTOR
>> 9. convert RDD back to DF and store it in a parquet file
>> 
>> The above steps worked fine in spark 1.6 but i m not sure why they run
>> painfully long in spark 2.0.
>> 
>> I am using spark 1.6 & spark 2.0 on HDP 2.5.3
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6-and-much-slower-in-spark-2-0-tp28390.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 

Reply via email to