Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Ant Kutschera
Hi *Do we have any option to make streaming queries with multiple stateful operations output data without waiting this extra iteration? One of my ideas was to force an empty microbatch to run and propagate late events watermark without any new data. While this conceptually works, I didn't find a

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Ant Kutschera
It might be good to first split the stream up into smaller streams, one per type. If ordering of the Kafka records is important, then you could partition them at the source based on the type, but be careful how you configure Spark to read from Kafka as that could also influence ordering. kdf

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Mich Talebzadeh
Use an intermediate work table to put json data streaming in there in the first place and then according to the tag store the data in the correct table HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Khalid Mammadov
Use foreachBatch or foreach methods: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch On Wed, 10 Jan 2024, 17:42 PRASHANT L, wrote: > Hi > I have a use case where I need to process json payloads coming from Kafka > using structured

Structured Streaming Process Each Records Individually

2024-01-10 Thread PRASHANT L
Hi I have a use case where I need to process json payloads coming from Kafka using structured streaming , but thing is json can have different formats , schema is not fixed and each json will have a @type tag so based on tag , json has to be parsed and loaded to table with tag name , and if a

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Yes, I agree. But apart from maintaining this state internally (in memory or in memory+disk as in case of RocksDB), every trigger it saves some information about this state in a checkpoint location. I'm afraid we can't do much about this checkpointing operation. I'll continue looking for

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, You may have a point on scenario 2. Caching Streaming DataFrames: In Spark Streaming, each batch of data is processed incrementally, and it may not fit the typical caching we discussed. Instead, Spark Streaming has its mechanisms to manage and optimize the processing of streaming data. Case

[Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Andrzej Zera
I'm struggling with the following issue in Spark >=3.4, related to multiple stateful operations. When spark.sql.streaming.statefulOperator.allowMultiple is enabled, Spark keeps track of two types of watermarks: eventTimeWatermarkForEviction and eventTimeWatermarkForLateEvents. Introducing them

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Hey, Yes, that's how I understood it (scenario 1). However, I'm not sure if scenario 2 is possible. I think cache on streaming DataFrame is supported only in forEachBatch (in which it's actually no longer a streaming DF). śr., 10 sty 2024 o 15:01 Mich Talebzadeh napisał(a): > Hi, > > With

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, With regard to your point - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try to cache batch ones to save on reads. But I'm not sure if it's what you mean, and I don't know how to apply what

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Thank you very much for your suggestions. Yes, my main concern is checkpointing costs. I went through your suggestions and here're my comments: - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try

unsubscribe

2024-01-10 Thread Daniel Maangi

[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path: