Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Thank you for both tips, I will definitely try the pandas_udfs. About changing the select operation, it's not possible to have multiple explode functions on the same select, sadly they must be applied one at a time. Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy < pmccar...@dstillery.com>

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into. I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Hi Patrick, thank you for your quick response. That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation. Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something? On Sat, Aug 1, 2020 at 8:34 PM hesouol wrote: > I forgot to add an information. By "can't

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-01 Thread hesouol
I forgot to add an information. By "can't write" I mean it keeps processing and nothing happens. The job runs for hours even with a very small file and I have to force the stoppage. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

[Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-01 Thread Henrique Oliveira
I have a PySpark method that applies the explode function on every Array column on the DataFrame. def explode_column(df, column): select_cols = list(df.columns) col_position = select_cols.index(column) select_cols[col_position] = explode_outer(column).alias(column) return