Thank you for both tips, I will definitely try the pandas_udfs. About changing the select operation, it's not possible to have multiple explode functions on the same select, sadly they must be applied one at a time.
Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy < pmccar...@dstillery.com> escreveu: > If you use pandas_udfs in 2.4 they should be quite performant (or at least > won't suffer serialization overhead), might be worth looking into. > > I didn't run your code but one consideration is that the while loop might > be making the DAG a lot bigger than it has to be. You might see if defining > those columns with list comprehensions forming a single select() statement > makes for a smaller DAG. > > On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <heso...@gmail.com> > wrote: > >> Hi Patrick, thank you for your quick response. >> That's exactly what I think. Actually, the result of this processing is >> an intermediate table that is going to be used for other views generation. >> Another approach I'm trying now, is to move the "explosion" step for this >> "view generation" step, this way I don't need to explode every column but >> just those used for the final client. >> >> ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the >> python udfs I tried had very bad performance, but I will give it a try in >> this case. It can't be worse. >> Thanks again! >> >> Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy < >> pmccar...@dstillery.com> escreveu: >> >>> This seems like a very expensive operation. Why do you want to write out >>> all the exploded values? If you just want all combinations of values, could >>> you instead do it at read-time with a UDF or something? >>> >>> On Sat, Aug 1, 2020 at 8:34 PM hesouol <heso...@gmail.com> wrote: >>> >>>> I forgot to add an information. By "can't write" I mean it keeps >>>> processing >>>> and nothing happens. The job runs for hours even with a very small file >>>> and >>>> I have to force the stoppage. >>>> >>>> >>>> >>>> -- >>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >> > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >