Hi John,
The number of rows in input file is 30 billion rows. The size of input data
is 72 GB, and the output is expected to have readings for each account &
day combination for 50k sample accounts, which means total output records
count =  50k * 365


On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Can you check in the UI which tasks took most of the time?
>
> Even the 45 min looks a little bit much given that you only work most of
> the time with 50k rows
>
> On 15 Feb 2017, at 00:03, Timur Shenkao <t...@timshenkao.su> wrote:
>
> Hello,
> I'm not sure that's your reason but check this discussion:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-
> performance-regression-between-1-6-and-2-x-td20803.html
>
> On Tue, Feb 14, 2017 at 9:25 PM, anatva <arun.na...@gmail.com> wrote:
>
>> Hi,
>> I am reading an ORC file, and perform some joins, aggregations and finally
>> generate a dense vector to perform analytics.
>>
>> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the
>> same
>> code is migrated to run on spark 2.0 on the same cluster, it takes around
>> 4-5 hours. It is surprising and frustrating.
>>
>> Can anyone take a look and help me what should I change in order to get
>> atleast same performance in spark 2.0.
>>
>> spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G
>> \
>> --num-executors 15 --executor-cores 5 --queue grid_analytics --conf
>> spark.yarn.executor.memoryOverhead=5120 --conf
>> spark.executor.heartbeatInterval=1200
>>
>> import sqlContext.implicits._
>> import org.apache.spark.storage.StorageLevel
>> import org.apache.spark.sql.functions.lit
>> import org.apache.hadoop._
>> import org.apache.spark.sql.functions.udf
>> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>> import org.apache.spark.ml.clustering.KMeans
>> import org.apache.spark.ml.feature.StandardScaler
>> import org.apache.spark.ml.feature.Normalizer
>>
>> here are the steps:
>> 1. read orc file
>> 2. filter some of the records
>> 3. persist resulting data frame
>> 4. get distinct accounts from the df and get a sample of 50k accts from
>> the
>> distinct list
>> 5. join the above data frame with distinct 50k accounts to pull records
>> for
>> only those 50k accts
>> 6. perform a group by to get the avg, mean, sum, count of readings for the
>> given 50k accts
>> 7. join the df obtained by GROUPBY with original DF
>> 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a
>> DENSE VECTOR
>> 9. convert RDD back to DF and store it in a parquet file
>>
>> The above steps worked fine in spark 1.6 but i m not sure why they run
>> painfully long in spark 2.0.
>>
>> I am using spark 1.6 & spark 2.0 on HDP 2.5.3
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6-
>> and-much-slower-in-spark-2-0-tp28390.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Regards,
Arun Kumar Natva

Reply via email to