Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Jörn Franke
I would remove the all GC tuning and add it later once you found the underlying root cause. Usually more GC means you need to provide more memory, because something has changed (your application, spark Version etc.) We don’t have your full code to give exact advise, but you may want to rethink

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Actually I didn't have any of the GC tuning in the beginning and then adding them also didn't made any difference. As mentioned earlier I tried low number executors of higher configuration and vice versa. Nothing helps. About the code its simple logistic regression nothing with explicit broadcast o

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
-dev@ Yep, high GC activity means '(almost) out of memory'. I don't see that you've checked heap usage - is it nearly full? The answer isn't tuning but more heap. (Sometimes with really big heaps the problem is big pauses, but that's not the case here.) On Mon, Jul 29, 2019 at 1:26 AM Dhrubajyoti

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Hi Sean, Yeah I checked the heap, its almost full. I checked the GC logs in the executors where I found that GC cycles are kicking in frequently. The Executors tab shows red in the "Total Time/GC Time". Also the data which I am dealing with is quite small(~4 GB) and the cluster is quite big for t

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Sean Owen
Could be lots of things. Implementations change, caching may have changed, etc. The size of the input doesn't really directly translate to heap usage. Here you just need a bit more memory. On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati wrote: > > Hi Sean, > > Yeah I checked the heap, its almost

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
Actually the original data is around ~120 GB. If we provide higher memory then we might require an even bigger cluster to finish training the whole model within planned time. And this will affect the cost of operations. Please correct me if I am wrong here. Nevertheless, can you point out how much

Number of tasks...

2019-07-29 Thread Muthu Jayakumar
Hello there, I have a basic question with how the number of tasks are determined per spark job. Let's say the scope of this discussion around parquet and Spark 2.x. 1. I thought that, the number of jobs is proportional to the number of part files that exist. Is this correct? 2. I noticed that for

repartitionByRange and number of tasks

2019-07-29 Thread Gourav Sengupta
Hi, *Hardware and Spark Details:* * Spark 2.4.3 * EMR 30 node cluster with each executor having 4 cores and 15 GB RAM. At 100% allocation 4 executors are running in each node *Question:* when I am executing the following code then the around 60 partitions being written out using only 20 tasks whi

Spark checkpoint problem for python api

2019-07-29 Thread zenglong chen
Hi, My code is below: from pyspark.conf import SparkConf from pyspark.context import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def test(record_list): print(list(record_list)) return record_list def functionToCreateConte