Thank you for both tips, I will definitely try the pandas_udfs. About
changing the select operation, it's not possible to have multiple explode
functions on the same select, sadly they must be applied one at a time.

Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy <
pmccar...@dstillery.com> escreveu:

> If you use pandas_udfs in 2.4 they should be quite performant (or at least
> won't suffer serialization overhead), might be worth looking into.
>
> I didn't run your code but one consideration is that the while loop might
> be making the DAG a lot bigger than it has to be. You might see if defining
> those columns with list comprehensions forming a single select() statement
> makes for a smaller DAG.
>
> On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <heso...@gmail.com>
> wrote:
>
>> Hi Patrick, thank you for your quick response.
>> That's exactly what I think. Actually, the result of this processing is
>> an intermediate table that is going to be used for other views generation.
>> Another approach I'm trying now, is to move the "explosion" step for this
>> "view generation" step, this way I don't need to explode every column but
>> just those used for the final client.
>>
>> ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the
>> python udfs I tried had very bad performance, but I will give it a try in
>> this case. It can't be worse.
>> Thanks again!
>>
>> Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <
>> pmccar...@dstillery.com> escreveu:
>>
>>> This seems like a very expensive operation. Why do you want to write out
>>> all the exploded values? If you just want all combinations of values, could
>>> you instead do it at read-time with a UDF or something?
>>>
>>> On Sat, Aug 1, 2020 at 8:34 PM hesouol <heso...@gmail.com> wrote:
>>>
>>>> I forgot to add an information. By "can't write" I mean it keeps
>>>> processing
>>>> and nothing happens. The job runs for hours even with a very small file
>>>> and
>>>> I have to force the stoppage.
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Reply via email to