Re: Identify bottleneck

2019-12-20 Thread Nicolas Paris
apparently the "withColumn" issue only apply for hundred or thousand of calls. This was not the case here (twenty calls) On Fri, Dec 20, 2019 at 08:53:16AM +0100, Enrico Minack wrote: > The issue is explained in depth here: https://medium.com/@manuzhang/ >

Re: Solved: Identify bottleneck

2019-12-20 Thread Antoine DUBOIS
voyé: Vendredi 20 Décembre 2019 09:39:49 Objet: Re: Identify bottleneck Cool, thanks! Very helpful On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack < [ mailto:m...@enrico.minack.dev | m...@enrico.minack.dev ] > wrote: The issue is explained in depth here: [ https://medium.com/

Re: Identify bottleneck

2019-12-20 Thread ayan guha
Cool, thanks! Very helpful On Fri, 20 Dec 2019 at 6:53 pm, Enrico Minack wrote: > The issue is explained in depth here: > https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 > > Am 19.12.19 um 23:33 schrieb Chris Teoh: > > As far as I'm aware it isn't any better. The

Re: Identify bottleneck

2019-12-19 Thread Enrico Minack
The issue is explained in depth here: https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 Am 19.12.19 um 23:33 schrieb Chris Teoh: As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from

Re: Identify bottleneck

2019-12-19 Thread Chris Teoh
As far as I'm aware it isn't any better. The logic all gets processed by the same engine so to confirm, compare the DAGs generated from both approaches and see if they're identical. On Fri, 20 Dec 2019, 8:56 am ayan guha, wrote: > Quick question: Why is it better to use one sql vs multiple

Re: Identify bottleneck

2019-12-19 Thread ayan guha
Quick question: Why is it better to use one sql vs multiple withColumn? isnt everything eventually rewritten by catalyst? On Wed, 18 Dec 2019 at 9:14 pm, Enrico Minack wrote: > How many withColumn statements do you have? Note that it is better to use > a single select, rather than lots of

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
didn't had time to let it finish. De: "Enrico Minack" À: "Chris Teoh" , "Antoine DUBOIS" Cc: "user @spark" Envoyé: Mercredi 18 Décembre 2019 14:29:07 Objet: Re: Identify bottleneck Good points, but single-line CSV files are splitable

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
pache.org <mailto:user@spark.apache.org>, "Antoine DUBOIS" mailto:antoine.dub...@cc.in2p3.fr>> *Envoyé: *Mercredi 18 Décembre 2019 11:13:38 *Objet: *Re: Identify bottleneck How many withColumn statements do you have? Note that it is better to use a s

Re: Identify bottleneck

2019-12-18 Thread Chris Teoh
; reasonable for maintaining purpose. > I will try on a local instance and let you know. > > Thanks for the help. > > > -- > *De: *"Enrico Minack" > *À: *user@spark.apache.org, "Antoine DUBOIS" > *Envoyé: *Mercredi 18 Décembre 20

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
À: user@spark.apache.org, "Antoine DUBOIS" Envoyé: Mercredi 18 Décembre 2019 11:13:38 Objet: Re: Identify bottleneck How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Reading 25m C

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Reading 25m CSV lines and writing to Parquet in 5 minutes on 32 cores is really slow. Can you try this on a single machine, i.e. run wit

Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Hello I'm working on an ETL based on csv describing file systems to transform it into parquet so I can work on them easily to extract informations. I'm using Mr. Powers framework Daria to do so. I've quiet different input and a lot of transformation and the framework helps organize the code.