Re: Migration to Spark 3.2

2022-01-25 Thread Aurélien Mazoyer
Hello, Sorry for asking twice, but anyone has any idea which issue I could be facing with this depencency problem :-/? Thank you, Aurelien Le sam. 22 janv. 2022 à 00:49, Aurélien Mazoyer a écrit : > Hello, > > I migrated my code to Spark 3.2 and I am facing some issues. When I run my > unit t

Small optimization questions

2022-01-25 Thread Aki Riisiö
Hello. We have a very simple AWS Glue job running with Spark: get some events from Kafka stream, do minor transformations, and write to S3. Recently, there was a change in Kafka topic which suddenly increased our data size * 10 and at the same time we were testing with different repartition value

Re: Small optimization questions

2022-01-25 Thread Sean Owen
How many partitions does the stream have? With 80 cores, you need at least 80 tasks to even take advantage of them, so if it's less than 80, at least .repartition(80). A crude rule of thumb is to have 2-3x as many tasks as cores, to help even out differences in task size by more finely distributing

Re: Small optimization questions

2022-01-25 Thread Aki Riisiö
Thank you for the reply. The stream is partitioned by year/month/day/hour, and we read the data once a day, so we are reading 24 partitions. " A crude rule of thumb is to have 2-3x as many tasks as cores" thank you very much, I will set this as default. Will this however change, if we also partiti

Re: Small optimization questions

2022-01-25 Thread Sean Owen
Yes, you will end up with 80 partitions, and if you write the result, you end up with 80 files. If you don't have at least 80 partitions, there is no point in have 80 cores. You will probably see 56 are idle even under load. The partitionBy might end up causing the whole job to have more partitions

Re: What happens when a partition that holds data under a task fails

2022-01-25 Thread Siddhesh Kalgaonkar
Thank you so much for the detailed explanation. I was able to recollect a few things about Spark. Thanks for your time once again :) On Mon, Jan 24, 2022 at 2:20 PM Mich Talebzadeh wrote: > Hm, > > I don't see what partition failure means here. > > You can have a node or executor failure etc.