Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Mich Talebzadeh
catching up a bit late on this, I mentioned optimising RockDB as below in my earlier thread, specifically # Add RocksDB configurations here spark.conf.set("spark.sql.streaming.stateStore.providerClass",

Re: Structured Streaming Process Each Records Individually

2024-01-11 Thread Mich Talebzadeh
Hi, Let us visit the approach as some fellow members correctly highlighted the use case for spark structured streaming and two key concepts that I will mention - foreach: A method for applying custom write logic to each individual row in a streaming DataFrame or Dataset. -

Best option to process single kafka stream in parallel: PySpark Vs Dask

2024-01-11 Thread lab22
I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload. 1. I can install dask in each

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded.

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-11 Thread Jungtaek Lim
Hi, The time window is closed and evicted as long as "eviction watermark" passes the end of the window. Late events watermark only deals with discarding late events from "inputs". We did not introduce additional delay on the work of multiple stateful operators. We just allowed more late events to

Re: Okio Vulnerability in Spark 3.4.1

2024-01-11 Thread Bjørn Jørgensen
[SPARK-46662][K8S][BUILD] Upgrade kubernetes-client to 6.10.0 a new version of kubernets-client with okio version 1.17.6 is now merged to master and will be in the spark 4.0 version. tir. 14. nov. 2023 kl. 15:21 skrev Bjørn Jørgensen : > FYI > I have