date:20220103

Re: Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Sonal Goyal

Hi Andrew, Do you think the following would work? Build data frames by appending a column source to each (sampleName). Add extra columns as per scheme of quantSchema. Then union. So you get one data frame with many entries per name. You can then use windowing functions over them. On Tue, 4 Jan

Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Andrew Davidson

Hi David I need to select 1 column from many files and combine them into a single table. I do not believe union() will work. It appends rows, not columns. As far as I know join() is the only way to append columns from different data frames. I think you correct that using lazy evaluation over

Re: Pyspark debugging best practices

2022-01-03 Thread David Diebold

Hello Andy, Are you sure you want to perform lots of join operations, and not simple unions ? Are you doing inner joins or outer joins ? Can you provide us with a rough amount of your list size plus each individual dataset size ? Have a look at execution plan would help, maybe the high amount of