Hi,

We were running Logistic Regression in Spark 2.2.X and then we tried to see
how does it do in Spark 2.3.X. Now we are facing an issue while running a
Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In
the TreeAggregate method it takes a huge time due to very High GC Activity.
I have tuned the GC, created different sized clusters, higher spark
version(2.4.X), smaller data but nothing helps. The GC time is 100 - 1000
times of the processing time in avg for iterations.

The strange part is in *Spark 2.2 this doesn't happen at all*. Same code,
same cluster sizing, same data in both the cases.

I was wondering if someone can explain this behaviour and help me to
resolve this. How can the same code has so different behaviour in two Spark
version, especially the higher ones?

Here are the config which I used:


spark.serializer=org.apache.spark.serializer.KryoSerializer

#GC Tuning

spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal
-XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms9000m
-XX:ParallelGCThreads=20 -XX:ConcGCThreads=5


spark.executor.instances=20

spark.executor.cores=1

spark.executor.memory=9010m


Regards,
Dhrub

Reply via email to