If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into.
I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those columns with list comprehensions forming a single select() statement makes for a smaller DAG. On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <heso...@gmail.com> wrote: > Hi Patrick, thank you for your quick response. > That's exactly what I think. Actually, the result of this processing is an > intermediate table that is going to be used for other views generation. > Another approach I'm trying now, is to move the "explosion" step for this > "view generation" step, this way I don't need to explode every column but > just those used for the final client. > > ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the > python udfs I tried had very bad performance, but I will give it a try in > this case. It can't be worse. > Thanks again! > > Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy < > pmccar...@dstillery.com> escreveu: > >> This seems like a very expensive operation. Why do you want to write out >> all the exploded values? If you just want all combinations of values, could >> you instead do it at read-time with a UDF or something? >> >> On Sat, Aug 1, 2020 at 8:34 PM hesouol <heso...@gmail.com> wrote: >> >>> I forgot to add an information. By "can't write" I mean it keeps >>> processing >>> and nothing happens. The job runs for hours even with a very small file >>> and >>> I have to force the stoppage. >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016