Actually the original data is around ~120 GB. If we provide higher memory then we might require an even bigger cluster to finish training the whole model within planned time. And this will affect the cost of operations. Please correct me if I am wrong here.
Nevertheless, can you point out how much memory should be sufficient for each executor. I already gave 9GB of Memory with 20 executors to process the data. On Mon, Jul 29, 2019 at 7:42 PM Sean Owen <sro...@gmail.com> wrote: > Could be lots of things. Implementations change, caching may have > changed, etc. The size of the input doesn't really directly translate > to heap usage. Here you just need a bit more memory. > > On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> > wrote: > > > > Hi Sean, > > > > Yeah I checked the heap, its almost full. I checked the GC logs in the > executors where I found that GC cycles are kicking in frequently. The > Executors tab shows red in the "Total Time/GC Time". > > > > Also the data which I am dealing with is quite small(~4 GB) and the > cluster is quite big for that high GC. > > > > But what's troubling me is this issue doesn't occur in Spark 2.2 at all. > What could be the reason behind such a behaviour? > > > > Regards, > > Dhrub > > > > On Mon, Jul 29, 2019 at 6:45 PM Sean Owen <sro...@gmail.com> wrote: > >> > >> -dev@ > >> > >> Yep, high GC activity means '(almost) out of memory'. I don't see that > >> you've checked heap usage - is it nearly full? > >> The answer isn't tuning but more heap. > >> (Sometimes with really big heaps the problem is big pauses, but that's > >> not the case here.) > >> > >> On Mon, Jul 29, 2019 at 1:26 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> > wrote: > >> > > >> > Hi, > >> > > >> > We were running Logistic Regression in Spark 2.2.X and then we tried > to see how does it do in Spark 2.3.X. Now we are facing an issue while > running a Logistic Regression Model in Spark 2.3.X on top of > Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time due to > very High GC Activity. I have tuned the GC, created different sized > clusters, higher spark version(2.4.X), smaller data but nothing helps. The > GC time is 100 - 1000 times of the processing time in avg for iterations. > >> > > >> > The strange part is in Spark 2.2 this doesn't happen at all. Same > code, same cluster sizing, same data in both the cases. > >> > > >> > I was wondering if someone can explain this behaviour and help me to > resolve this. How can the same code has so different behaviour in two Spark > version, especially the higher ones? > >> > > >> > Here are the config which I used: > >> > > >> > > >> > spark.serializer=org.apache.spark.serializer.KryoSerializer > >> > > >> > #GC Tuning > >> > > >> > spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal > -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy > -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms9000m > -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 > >> > > >> > > >> > spark.executor.instances=20 > >> > > >> > spark.executor.cores=1 > >> > > >> > spark.executor.memory=9010m > >> > > >> > > >> > > >> > Regards, > >> > Dhrub > >> > >