Re: Identify bottleneck

2019-12-20 Thread Nicolas Paris
apparently the "withColumn" issue only apply for hundred or thousand of calls. This was not the case here (twenty calls) On Fri, Dec 20, 2019 at 08:53:16AM +0100, Enrico Minack wrote: > The issue is explained in depth here: https://medium.com/@manuzhang/ > the-hidden-cost-of-spark-withcolumn-8ffea

Re: Solved: Identify bottleneck

2019-12-20 Thread Antoine DUBOIS
voyé: Vendredi 20 Décembre 2019 09:39:49 Objet: Re: Identify bottleneck Cool, thanks! Very helpful On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack < [ mailto:m...@enrico.minack.dev | m...@enrico.minack.dev ] > wrote: The issue is explained in depth here: [ https://medium.com/

Re: Identify bottleneck

2019-12-20 Thread ayan guha
Cool, thanks! Very helpful On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack wrote: > The issue is explained in depth here: > https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 > > Am 19.12.19 um 23:33 schrieb Chris Teoh: > > As far as I'm aware it isn't any better. The l

Re: Identify bottleneck

2019-12-19 Thread Enrico Minack
The issue is explained in depth here: https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 Am 19.12.19 um 23:33 schrieb Chris Teoh: As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from b

Re: Identify bottleneck

2019-12-19 Thread Chris Teoh
As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from both approaches and see if they're identical. On Fri, 20 Dec 2019, 8:56 am ayan guha, wrote: > Quick question: Why is it better to use one sql vs multiple withC

Re: Identify bottleneck

2019-12-19 Thread ayan guha
Quick question: Why is it better to use one sql vs multiple withColumn? isnt everything eventually rewritten by catalyst? On Wed, 18 Dec 2019 at 9:14 pm, Enrico Minack wrote: > How many withColumn statements do you have? Note that it is better to use > a single select, rather than lots of withCo

Re: Identify bottleneck

2019-12-19 Thread Chris Teoh
: *"Chris Teoh" , "user @spark" < > user@spark.apache.org> > *Envoyé: *Mercredi 18 Décembre 2019 14:59:12 > *Objet: *Re: Identify bottleneck > > > I can confirm that the job is able to use multiple cores on multiple nodes > at the same time and that I hav

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
edi 18 Décembre 2019 14:59:12 Objet: Re: Identify bottleneck I can confirm that the job is able to use multiple cores on multiple nodes at the same time and that I have several task running at the same time. Depending on my csv it take from 5 part up to several hundred part. Regarding the job runn

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
didn't had time to let it finish. De: "Enrico Minack" À: "Chris Teoh" , "Antoine DUBOIS" Cc: "user @spark" Envoyé: Mercredi 18 Décembre 2019 14:29:07 Objet: Re: Identify bottleneck Good points, but single-line CSV files are splitable (n

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
er@spark.apache.org <mailto:user@spark.apache.org>, "Antoine DUBOIS" mailto:antoine.dub...@cc.in2p3.fr>> *Envoyé: *Mercredi 18 Décembre 2019 11:13:38 *Objet: *Re: Identify bottleneck How many withColumn statements do you have? Note that it is better to use

Re: Identify bottleneck

2019-12-18 Thread Chris Teoh
ot > reasonable for maintaining purpose. > I will try on a local instance and let you know. > > Thanks for the help. > > > -- > *De: *"Enrico Minack" > *À: *user@spark.apache.org, "Antoine DUBOIS" > *Envoyé: *Mercredi 18 D

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
nack" À: user@spark.apache.org, "Antoine DUBOIS" Envoyé: Mercredi 18 Décembre 2019 11:13:38 Objet: Re: Identify bottleneck How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Readin

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Reading 25m CSV lines and writing to Parquet in 5 minutes on 32 cores is really slow. Can you try this on a single machine, i.e. run wit "

Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Hello I'm working on an ETL based on csv describing file systems to transform it into parquet so I can work on them easily to extract informations. I'm using Mr. Powers framework Daria to do so. I've quiet different input and a lot of transformation and the framework helps organize the code.