Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
Hi Everyone, I'm running spark 3.2 on kubernetes and have a job with a decently sized shuffle of almost 4TB. The relevant cluster config is as follows: - 30 Executors. 16 physical cores, configured with 32 Cores for spark - 128 GB RAM - shuffle.partitions is 18k which gives me tasks of around

Re: Help with Shuffle Read performance

2022-09-29 Thread Tufan Rakshit
that's Total Nonsense , EMR is total crap , use kubernetes i will help you . can you please provide whats the size of the shuffle file that is getting generated in each task . What's the total number of Partitions that you have ? What machines are you using ? Are you using an SSD ? Best Tufan

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
> What's the total number of Partitions that you have ? 18k > What machines are you using ? Are you using an SSD ? Using a family of r5.4xlarges nodes. Yes I'm using five GP3 Disks which gives me about 625 MB/s of sustained throughput (which is what I see when writing the shuffle data). > can

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi, why not use EMR or data proc, kubernetes does not provide any benefit at all for such scale of work. It is a classical case of over engineering and over complication just for the heck of it. Also I think that in case you are in AWS, Redshift Spectrum or Athena for 90% of use cases are way

Re: Help with Shuffle Read performance

2022-09-29 Thread Vladimir Prus
Igor, what exact instance types do you use? Unless you use local instance storage and have actually configured your Kubernetes and Spark to use instance storage, your 30x30 exchange can run into EBS IOPS limits. You can investigate that by going to an instance, then to volume, and see monitoring

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi, dont containers finally run on systems, and the only advantage of containers is that you can do better utilisation of system resources by micro management of jobs running in it? Some say that containers have their own binaries which isolates environment, but that is a lie, because in a

Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread 姜鑫
Hi folks, Has anyone used VarianceThresholdSelector refer to https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector ? In the doc, an example is gaven and says `The variance for the 6

Re: Help with Shuffle Read performance

2022-09-29 Thread Leszek Reimus
Hi Everyone, To add my 2 cents here: Advantage of containers, to me, is that it leaves the host system pristine and clean, allowing standardized devops deployment of hardware for any purpose. Way back before - when using bare metal / ansible, reusing hw always involved full reformat of base

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta
Hi Leszek, spot on, therefore EMR being created and dynamically scaled up and down and being ephemeral proves that there is actually no advantage of using containers for large jobs. It is utterly pointless and I have attended interviews and workshops where no one has ever been able to prove its

Re: Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread Sean Owen
This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population. On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 wrote: > Hi folks, > > Has anyone used VarianceThresholdSelector refer to >

depolying stage-level scheduling for Spark SQL and how to expose RDD code from Spark SQL?

2022-09-29 Thread Chenghao Lyu
Hi, I am trying to deploy the stage-level scheduling for Spark SQL. Since the current stage-level scheduling only supports the RDD-APIs, I want to expose the RDD transformation code from my Spark SQL code (with SQL syntax). Can you provide any pointers on how to do it? Stage level scheduling: