Thank you for both tips, I will definitely try the pandas_udfs. About
changing the select operation, it's not possible to have multiple explode
functions on the same select, sadly they must be applied one at a time.
Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy <
pmccar...@dstillery.com>
If you use pandas_udfs in 2.4 they should be quite performant (or at least
won't suffer serialization overhead), might be worth looking into.
I didn't run your code but one consideration is that the while loop might
be making the DAG a lot bigger than it has to be. You might see if defining
those
Hi Patrick, thank you for your quick response.
That's exactly what I think. Actually, the result of this processing is an
intermediate table that is going to be used for other views generation.
Another approach I'm trying now, is to move the "explosion" step for this
"view generation" step, this
This seems like a very expensive operation. Why do you want to write out
all the exploded values? If you just want all combinations of values, could
you instead do it at read-time with a UDF or something?
On Sat, Aug 1, 2020 at 8:34 PM hesouol wrote:
> I forgot to add an information. By "can't
I forgot to add an information. By "can't write" I mean it keeps processing
and nothing happens. The job runs for hours even with a very small file and
I have to force the stoppage.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
I have a PySpark method that applies the explode function on every Array
column on the DataFrame.
def explode_column(df, column):
select_cols = list(df.columns)
col_position = select_cols.index(column)
select_cols[col_position] = explode_outer(column).alias(column)
return