Thank you for the reply.
The stream is partitioned by year/month/day/hour, and we read the data once
a day, so we are reading 24 partitions.

" A crude rule of thumb is to have 2-3x as many tasks as cores" thank you
very much, I will set this as default. Will this however change, if we also
partition the data by year/month/day/hour? If I set:
df.repartition(80),write ... partitionBy("year", "month", "day", "hour"),
will this cause each hour to have 80 output files?

The output data in a "normal" run is very small, so a big partition size
would result in a large number of too small files.
I am not sure how Glue autoscales itself, but I definitely need to look
that up a bit more.

One of our jobs actually has a requirement to have only one output-file, so
is the only way to achieve that by repartition(1)? As I understand it, this
is a major issue in performance.

Thank you!


On Tue, 25 Jan 2022 at 15:29, Sean Owen <sro...@gmail.com> wrote:

> How many partitions does the stream have? With 80 cores, you need at least
> 80 tasks to even take advantage of them, so if it's less than 80, at least
> .repartition(80). A crude rule of thumb is to have 2-3x as many tasks as
> cores, to help even out differences in task size by more finely
> distributing the work. You might even go for more. I'd watch the task
> length, and as long as the tasks aren't completing in a few seconds or
> less, you probably don't have too many.
>
> This is also a good reason to use autoscaling, so that when not busy you
> can (for example) scale down to 1 executor, but under load, scale up to 10
> or 20 machines if needed. That is also a good reason to repartition more,
> so that it's possible to take advantage of more parallelism when needed.
>
> On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö <aki.rii...@gmail.com> wrote:
>
>> Hello.
>>
>> We have a very simple AWS Glue job running with Spark: get some events
>> from Kafka stream, do minor transformations, and write to S3.
>>
>> Recently, there was a change in Kafka topic which suddenly increased our
>> data size * 10 and at the same time we were testing with different
>> repartition values during df.repartition(n).write ...
>> At the time when Kafka started sending an increased volume of data, we
>> didn't actually have the repartition value set in our write.
>> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0)
>> jumped from 2h to 10h. Here are some details of the save stage from SparkUI:
>> - Only 5 executors, which can run 16 tasks parallel each
>> - 10500 tasks (job is still running...) with medians for duration=2,6min
>> and GC time= 2s
>> - Input size per executor is 9GB and output is 4,5GB
>> - executor memory is 20GB
>>
>> My question is now that we're trying to find a proper value for
>> repartition, what would be the optimal value here? Our data volume was not
>> expected to go this high, but there are times when it might be. As this job
>> is running in AWS Glue, should I also consider setting the executor amount,
>> cores, and memory manually? I think Glue is actually setting those based on
>> the Glue job configuration. Yes, this is not probably your concern but
>> still, worth a shot :)
>>
>> Thank you!
>>
>>

Reply via email to