Hi Gourav,
The static table is broadcasted prior to the join so the shuffle is primarily to
avoid OOME during the UDF.
It's not quite a Cartesian product, but yes the join results in multiple records
per input record. The number of output records varies depending on the number of
duplicates in
set
> spark.hadoop.parquet.block.size in our spark config for writing to Delta.
>
> Adam
>
> On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho
> wrote:
>
>> I tried re-writing the table with the updated block size but it doesn't
>> appear to have an effect on the ro
um_row_groups: 1
format_version: 1.0
serialized_size: 6364
Columns
...
Chris
On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote:
> It should just be parquet.block.size indeed.
> spark.write.option("parquet.block.size", "16m").parquet(...)
> This is an issue in ho
table with a much smaller row group size which wouldn't be
> ideal performance wise).
>
> Adam
>
> On Fri, Feb 11, 2022 at 7:23 AM Chris Coutinho
> wrote:
>
>> Hello,
>>
>> We have a spark structured streaming job that includes a stream-static
>> join a
Hello,
We have a spark structured streaming job that includes a stream-static join
and a Pandas UDF, streaming to/from delta tables. The primary key of the
static table is non-unique, meaning that the streaming join results in
multiple records per input record - in our case 100x increase. The
I'm also curious if this is possible, so while I can't offer a solution
maybe you could try the following.
The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file system to
your laptop, locally, and then try to submit jobs and/or