Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
Hi Gourav, The static table is broadcasted prior to the join so the shuffle is primarily to avoid OOME during the UDF. It's not quite a Cartesian product, but yes the join results in multiple records per input record. The number of output records varies depending on the number of duplicates in

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
set > spark.hadoop.parquet.block.size in our spark config for writing to Delta. > > Adam > > On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho > wrote: > >> I tried re-writing the table with the updated block size but it doesn't >> appear to have an effect on the ro

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
um_row_groups: 1 format_version: 1.0 serialized_size: 6364 Columns ... Chris On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote: > It should just be parquet.block.size indeed. > spark.write.option("parquet.block.size", "16m").parquet(...) > This is an issue in ho

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
table with a much smaller row group size which wouldn't be > ideal performance wise). > > Adam > > On Fri, Feb 11, 2022 at 7:23 AM Chris Coutinho > wrote: > >> Hello, >> >> We have a spark structured streaming job that includes a stream-static >> join a

Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
Hello, We have a spark structured streaming job that includes a stream-static join and a Pandas UDF, streaming to/from delta tables. The primary key of the static table is non-unique, meaning that the streaming join results in multiple records per input record - in our case 100x increase. The

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Chris Coutinho
I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following. The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or