Thank you for the reply. The stream is partitioned by year/month/day/hour, and we read the data once a day, so we are reading 24 partitions.
" A crude rule of thumb is to have 2-3x as many tasks as cores" thank you very much, I will set this as default. Will this however change, if we also partition the data by year/month/day/hour? If I set: df.repartition(80),write ... partitionBy("year", "month", "day", "hour"), will this cause each hour to have 80 output files? The output data in a "normal" run is very small, so a big partition size would result in a large number of too small files. I am not sure how Glue autoscales itself, but I definitely need to look that up a bit more. One of our jobs actually has a requirement to have only one output-file, so is the only way to achieve that by repartition(1)? As I understand it, this is a major issue in performance. Thank you! On Tue, 25 Jan 2022 at 15:29, Sean Owen <sro...@gmail.com> wrote: > How many partitions does the stream have? With 80 cores, you need at least > 80 tasks to even take advantage of them, so if it's less than 80, at least > .repartition(80). A crude rule of thumb is to have 2-3x as many tasks as > cores, to help even out differences in task size by more finely > distributing the work. You might even go for more. I'd watch the task > length, and as long as the tasks aren't completing in a few seconds or > less, you probably don't have too many. > > This is also a good reason to use autoscaling, so that when not busy you > can (for example) scale down to 1 executor, but under load, scale up to 10 > or 20 machines if needed. That is also a good reason to repartition more, > so that it's possible to take advantage of more parallelism when needed. > > On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö <aki.rii...@gmail.com> wrote: > >> Hello. >> >> We have a very simple AWS Glue job running with Spark: get some events >> from Kafka stream, do minor transformations, and write to S3. >> >> Recently, there was a change in Kafka topic which suddenly increased our >> data size * 10 and at the same time we were testing with different >> repartition values during df.repartition(n).write ... >> At the time when Kafka started sending an increased volume of data, we >> didn't actually have the repartition value set in our write. >> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0) >> jumped from 2h to 10h. Here are some details of the save stage from SparkUI: >> - Only 5 executors, which can run 16 tasks parallel each >> - 10500 tasks (job is still running...) with medians for duration=2,6min >> and GC time= 2s >> - Input size per executor is 9GB and output is 4,5GB >> - executor memory is 20GB >> >> My question is now that we're trying to find a proper value for >> repartition, what would be the optimal value here? Our data volume was not >> expected to go this high, but there are times when it might be. As this job >> is running in AWS Glue, should I also consider setting the executor amount, >> cores, and memory manually? I think Glue is actually setting those based on >> the Glue job configuration. Yes, this is not probably your concern but >> still, worth a shot :) >> >> Thank you! >> >>