Re: Identify bottleneck

2019-12-19 Thread Enrico Minack
The issue is explained in depth here: https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 Am 19.12.19 um 23:33 schrieb Chris Teoh: As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from

optimising cluster performance

2019-12-19 Thread Sriram Bhamidipati
Hi All Sorry, earlier, I forgot to set the subject line correctly > Hello Experts > I am trying to maximise the resource utilisation on my 3 node spark > cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am > trying to create a benchmark so I can recommend an optimal POD

Re: How to estimate the rdd size before the rdd result is written to disk

2019-12-19 Thread Sriram Bhamidipati
Hello Experts I am trying to maximise the resource utilisation on my 3 node spark cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am trying to create a benchmark so I can recommend an optimal POD for the job 128GB x 16 cores I have standalone spark running 2.4.0 HTOP shows

How to estimate the rdd size before the rdd result is written to disk

2019-12-19 Thread zhangliyun
Hi all: i want to ask a question about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. following step is what i can solve for this problem 1.sample 0.01 's

unsubscribe

2019-12-19 Thread Sethupathi T
unsubscribe

Re: Identify bottleneck

2019-12-19 Thread Chris Teoh
As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from both approaches and see if they're identical. On Fri, 20 Dec 2019, 8:56 am ayan guha, wrote: > Quick question: Why is it better to use one sql vs multiple

Re: Identify bottleneck

2019-12-19 Thread ayan guha
Quick question: Why is it better to use one sql vs multiple withColumn? isnt everything eventually rewritten by catalyst? On Wed, 18 Dec 2019 at 9:14 pm, Enrico Minack wrote: > How many withColumn statements do you have? Note that it is better to use > a single select, rather than lots of