Hi, I am reading an ORC file, and perform some joins, aggregations and finally generate a dense vector to perform analytics.
The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same code is migrated to run on spark 2.0 on the same cluster, it takes around 4-5 hours. It is surprising and frustrating. Can anyone take a look and help me what should I change in order to get atleast same performance in spark 2.0. spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G \ --num-executors 15 --executor-cores 5 --queue grid_analytics --conf spark.yarn.executor.memoryOverhead=5120 --conf spark.executor.heartbeatInterval=1200 import sqlContext.implicits._ import org.apache.spark.storage.StorageLevel import org.apache.spark.sql.functions.lit import org.apache.hadoop._ import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.feature.StandardScaler import org.apache.spark.ml.feature.Normalizer here are the steps: 1. read orc file 2. filter some of the records 3. persist resulting data frame 4. get distinct accounts from the df and get a sample of 50k accts from the distinct list 5. join the above data frame with distinct 50k accounts to pull records for only those 50k accts 6. perform a group by to get the avg, mean, sum, count of readings for the given 50k accts 7. join the df obtained by GROUPBY with original DF 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a DENSE VECTOR 9. convert RDD back to DF and store it in a parquet file The above steps worked fine in spark 1.6 but i m not sure why they run painfully long in spark 2.0. I am using spark 1.6 & spark 2.0 on HDP 2.5.3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6-and-much-slower-in-spark-2-0-tp28390.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org