subject:"Spark Structured Streaming"

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner

ninterruptiblyIfPossible(KafkaDataConsumer.scala:656) at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:299) at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:79) ``` Best regards Wolfgang Buchner On 2025/0

RE: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-15 Thread Wolfgang Buchner

UninterruptiblyIfPossible(KafkaDataConsumer.scala:656) at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:299) at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:79) ``` Best regards Wolfgang Buchner On 2025/07/10 10:04

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-14 Thread Khalid Mammadov

red-streaming-kafka-integration.html >> ): >> >> "latest" for streaming, "earliest" for batch >> >> >> On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: >> >>> Hi everyone, >>> >>> I'm currently working wit

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-13 Thread Nimrod Ofek

atest/streaming/structured-streaming-kafka-integration.html > ): > > "latest" for streaming, "earliest" for batch > > > On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: > >> Hi everyone, >> >> I'm currently working with Spark Structured Streaming

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Khalid Mammadov

://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html ): "latest" for streaming, "earliest" for batch On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, wrote: > Hi everyone, > > I'm currently working with Spark Structured Streaming integrated w

Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

2025-07-10 Thread Nimrod Ofek

Hi everyone, I'm currently working with Spark Structured Streaming integrated with Kafka and had some questions regarding the failOnDataLoss option. The current documentation states: *"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or

[Spark Structured Streaming] Is it possible to enable AQE in some case?

2024-12-19 Thread bluzy

Hi, I'm using Structured Streaming to consume data from Kafka and load it into HDFS. I discovered that AQE (Adaptive Query Execution) is forcibly disabled when using Structured Streaming, as noted in this issue: https://issues.apache.org/jira/browse/SPARK-19873 In my case, the streaming query doe

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Andrei L

you have two options >> #1 cleanup WAL files (afaik it’s named _metadata folder insider your data >> folder) which requires that SSS job has to be stopped before you are >> cleaning the WAL. >> #2 you can use foreachBatch for write your data but then your SSS will >> not

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Mich Talebzadeh

for write your data but then your SSS will not > be exactly once but at least once > > Best regards > > On 3 Dec 2024, at 17:07, Дубинкин Егор wrote: > > > Hello Community, > > I need to delete old src data created by Spark Structured Streaming. > Just deleting r

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)

write your data but then your SSS will not be exactly once but at least once Best regards > On 3 Dec 2024, at 17:07, Дубинкин Егор wrote: > > > Hello Community, > > I need to delete old src data created by Spark Structured Streaming. > Just deleting relevant folder thro

RE: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Дубинкин Егор

Forgot to mention: Spark 3.5.2 is used On 2024/12/03 15:05:18 Дубинкин Егор wrote: > Hello Community, > > I need to delete old src data created by Spark Structured Streaming. > Just deleting relevant folder throws an exception while reading batch > dataframe f

[Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Дубинкин Егор

Hello Community, I need to delete old src data created by Spark Structured Streaming. Just deleting relevant folder throws an exception while reading batch dataframe from file-system: java.io.FileNotFoundException: File file:/data/avro/year=2020/month=3/day=13/hour=12/part-0-0cc84e65-3f49

Re: Update mode in spark structured streaming

2024-06-15 Thread Mich Talebzadeh

Best to qualify your thoughts with an example By using the foreachBatch function combined with the update output mode in Spark Structured Streaming, you can effectively handle and integrate late-arriving data into your aggregations. This approach will allow you to continuously update your

Update mode in spark structured streaming

2024-06-14 Thread Om Prakash

Hi Team, Hope you all are doing well. I have run into a use case in which I want to do the aggregation in foreachbatch and use update mode for handling late data in structured streaming. Will this approach work in effectively capturing late arriving data in the aggregations? Please help. Thank

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh

nformation provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar

Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh

explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 9 Feb 2024 at 16:16, Mich Talebzadeh wrote: > Appreciate your thoughts on this, Personally I think Spark Structured > Streaming can be used effectively i

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh

Appreciate your thoughts on this, Personally I think Spark Structured Streaming can be used effectively in an Event Driven Architecture as well as continuous streaming) >From the link here <https://www.linkedin.com/posts/activity-7161748945801617409-v29V?utm_source=share&

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh

Hi Ashok, Thanks for pointing out the databricks article Scalable Spark Structured Streaming for REST API Destinations | Databricks Blog <https://www.databricks.com/blog/scalable-spark-structured-streaming-rest-api-destinations> I browsed it and it is basically similar to many of us in

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID

Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations.

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh

destruction. On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh wrote: > Thought it might be useful to share my idea with fellow forum members. During > the breaks, I worked on the *seamless integration of Spark Structured > Streaming with Flask REST API for real-time data ingestion and analyt

Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh

Thought it might be useful to share my idea with fellow forum members. During the breaks, I worked on the *seamless integration of Spark Structured Streaming with Flask REST API for real-time data ingestion and analytics*. The use case revolves around a scenario where data is generated through

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim

The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming applica

[Spark Structured Streaming] Two sink from Single stream

2023-11-15 Thread Subash Prabanantham

Hi Team, I am working on a basic streaming aggregation where I have one file stream source and two write sinks (Hudi table). The only difference is the aggregation performed is different, hence I am using the same spark session to perform both operations. (File Source) --> Agg1 -> DF1 -->

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa

Unsubscribe > Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh > escreveu: > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No. a co

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh

day, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No.

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID

responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS) internals. Specifically it

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh

Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID

Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker

Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh

Hi, Autoscaling is not compatible with Spark Structured Streaming <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support d

[Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Aishwarya Panicker

Hi Team, I have been working on Spark Structured Streaming and trying to autoscale our application through dynamic allocation. But I couldn't find any documentation or configurations that supports dynamic scaling in Spark Structured Streaming, due to which I had been using Spark Batch

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Mich Talebzadeh

Agreed. How does asynchronous communication relate to Spark Structured streaming? In the previous post of yours, you made your Spark to run on the driver in a single JVM. You attempted to increase the number of executors to 3 after submission of the job that (as Sean alluded to) would not work

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen

What do you mean by asynchronously here? On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello again, > > Do we have any news for the above question? > I would really appreciate it. > > Thank you, > > --

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Emmanouil Kritharakis

Hello again, Do we have any news for the above question? I would really appreciate it. Thank you, -- Emmanouil (Manos) Kritharakis Ph.D. candidate in the Department of Computer Science

Question related to asynchronously map transformation using java spark structured streaming

2023-03-14 Thread Emmanouil Kritharakis

Hello, I hope this email finds you well! I have a simple dataflow in which I read from a kafka topic, perform a map transformation and then I write the result to another topic. Based on your documentation here

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread hueiyuan su

ot;) \ >> .load() \ >> .select(from_json(col("value").cast("string"), >> schema).alias("parsed_value")) >> >> Ok, one secure way of doing it though shutting down the streaming process >> gracefully without loss

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread Mich Talebzadeh

") \ > .load() \ > .select(from_json(col("value").cast("string"), > schema).alias("parsed_value")) > > Ok, one secure way of doing it though shutting down the streaming process > gracefully without loss of data that

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-07 Thread Mich Talebzadeh

quot;).cast("string"), schema).alias("parsed_value")) Ok, one secure way of doing it though shutting down the streaming process gracefully without loss of data that impacts consumers. The other method implies inflight changes as suggested by the topic with zeio in

Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-06 Thread Mich Talebzadeh

Spark Structured Streaming can write to anything as long as an appropriate API or JDBC connection exists. I have not tried Kinesis but have you thought about how you want to write it as a Sync? Those quota limitations, much like quotas set by the vendors (say Google on BigQuery writes etc) are

[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-05 Thread hueiyuan su

*Component*: Spark Structured Streaming *Level*: Advanced *Scenario*: How-to *Problems Description* 1. I currently would like to use pyspark structured streaming to write data to kinesis. But it seems like does not have corresponding connector can use. I would confirm

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-05 Thread Mich Talebzadeh

OK I found a workaround. Basically each stream state is not kept and I have two streams. One is a business topic and the other one created to shut down spark structured streaming gracefully. I was interested to print the value for the most recent batch Id for the business topic called "md&

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh

>> From sendToControl, newtopic batchId is 76 >>> From sendToSink, md, batchId is 563 >>> >>> As a matter of interest, why does a global variable not work? >>> >>> >>> >>>view my Linkedin profile >>> <https://www.linked

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh

>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable f

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

gt; >> It's the same batch ID already, no? >> Or why not simply put the logic of both in one function? or write one >> function that calls both? >> >> On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh >> wrote: >> >>> >>> This is prob

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh

ar 4, 2023 at 2:07 PM Mich Talebzadeh > wrote: > >> >> This is probably pretty straight forward but somehow is does not look >> that way >> >> >> >> On Spark Structured Streaming, "foreachBatch" performs custom write >> logic on each mic

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

t; > > > On Spark Structured Streaming, "foreachBatch" performs custom write logic > on each micro-batch through a call function. Example, > > foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as > DataFrame or Dataset and second: unique id for each batch >

How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh

This is probably pretty straight forward but somehow is does not look that way On Spark Structured Streaming, "foreachBatch" performs custom write logic on each micro-batch through a call function. Example, foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as Da

Re: Graceful shutdown SPARK Structured Streaming

2023-02-20 Thread Mich Talebzadeh

ime.sleep(0.5) >> >> # Okay wait for the stop to happen >> print('Awaiting termination...') >> query.awaitTermination(wait_time) >> ``` >> >> >> I'd also be interested is there is a newer/better way to do this.. so please >>

Re: Graceful shutdown SPARK Structured Streaming

2023-02-19 Thread Bjørn Jørgensen

query.awaitTermination(wait_time) > ``` > > > I'd also be interested is there is a newer/better way to do this.. so please > cc me on updates :) > > > On Thu, May 6, 2021 at 1:08 PM Mich Talebzadeh > wrote: > >> That is a valid question and I am not awa

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh

t due to my on-going > personal stuff. I'll adjust the JIRA first. > > Thanks, > Dongjoon. > > > On Sat, Feb 18, 2023 at 10:51 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> https://issues.apache.org/jira/browse/SPARK-42485 >> >> >

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau

m> wrote: > >> https://issues.apache.org/jira/browse/SPARK-42485 >> >> >> Spark Structured Streaming is a very useful tool in dealing with Event >> Driven Architecture. In an Event Driven Architecture, there is generally a >> main loop that listens for e

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Dongjoon Hyun

ongjoon. On Sat, Feb 18, 2023 at 10:51 AM Mich Talebzadeh wrote: > https://issues.apache.org/jira/browse/SPARK-42485 > > > Spark Structured Streaming is a very useful tool in dealing with Event > Driven Architecture. In an Event Driven Architecture, there is generally a > main l

SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Mich Talebzadeh

https://issues.apache.org/jira/browse/SPARK-42485 Spark Structured Streaming is a very useful tool in dealing with Event Driven Architecture. In an Event Driven Architecture, there is generally a main loop that listens for events and then triggers a call-back function when one of those events is

Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread Vikas Kumar

Doesn't directly answer your question but there are ways in scala and pyspark - See if this helps: https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark On Thu, Feb 16, 2023, 8:27 PM hueiyuan su wrote: > *Component*: Spark Structured S

[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread hueiyuan su

*Component*: Spark Structured Streaming *Level*: Advanced *Scenario*: How-to *Problems Description* I would like to implement witeStream data to AWS Kinesis with Spark structured Streaming, but I do not find related connector jar can be used. I want to check whether fully

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-02-15 Thread Jack Goodson

at 5:12 PM, hueiyuan su wrote: > *Component*: Spark Structured Streaming > *Level*: Advanced > *Scenario*: How-to > > - > *Problems Description* > I would like to confirm could we directly apply new options of > readStream/writeStream without stoppi

[Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-02-15 Thread hueiyuan su

*Component*: Spark Structured Streaming *Level*: Advanced *Scenario*: How-to - *Problems Description* I would like to confirm could we directly apply new options of readStream/writeStream without stopping current running spark structured streaming applications? For example

Re: Graceful shutdown SPARK Structured Streaming

2023-02-08 Thread Brian Wylie

(wait_time) ``` I'd also be interested is there is a newer/better way to do this.. so please cc me on updates :) On Thu, May 6, 2021 at 1:08 PM Mich Talebzadeh wrote: > That is a valid question and I am not aware of any new addition to Spark > Structured Streaming (SSS) in newer

Fwd: Graceful shutdown SPARK Structured Streaming

2023-02-07 Thread Mich Talebzadeh

-- Forwarded message - From: Mich Talebzadeh Date: Thu, 6 May 2021 at 20:07 Subject: Re: Graceful shutdown SPARK Structured Streaming To: ayan guha Cc: Gourav Sengupta , user @spark < user@spark.apache.org> That is a valid question and I am not aware of any new addit

Re: [SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-11-30 Thread Adam Binford

We started hitting this as well, seeing 90+ GB resident memory on a 25 GB heap executor. After a lot of manually testing fixes, I finally figured out the root problem: https://issues.apache.org/jira/browse/SPARK-41339 Starting to work on a PR now to fix. On Mon, Sep 12, 2022 at 10:46 AM Artemis U

Spark Structured Streaming Duplicate in ForEachBatch with BatchId

2022-11-13 Thread Vedant Shirodkar

Hello Spark Team, Greetings! I am writing this mail to get suggestions on the observation below. *Use Case:* Spark Structured Streaming to extract data from Azure Event Hub, process it, and write it to Snowflake Database table using ForEachBatch with Epoch_Id/ Batch_Id passed to the foreach

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang

here is the stackoverflow link https://stackoverflow.com/questions/73780259/spark-structured-streaming-stderr-getting-filled-up On Mon, Sep 19, 2022 at 4:41 PM karan alang wrote: > I've created a stackoverflow ticket for this as well > > On Mon, Sep 19, 2022 at 4:37 PM kara

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang

I've created a stackoverflow ticket for this as well On Mon, Sep 19, 2022 at 4:37 PM karan alang wrote: > Hello All, > I've a Spark Structured Streaming job on GCP Dataproc - which picks up > data from Kafka, does processing and pushes data back into kafka topics. > >

Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang

Hello All, I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics. Couple of questions : 1. Does Spark put all the log (incl. INFO, WARN etc) into stderr ? What I notice is that stdout is empty, while al

Re: [SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-12 Thread Artemis User

The off-heap memory isn't subjected to GC. So the obvious reason is that your have too many states to maintain in your streaming app, and the GC couldn't keep up, and end up with resources but to die. Are you using continues processing or microbatch in structured streaming? You may want to lo

[SPARK STRUCTURED STREAMING] : Rocks DB uses off-heap usage

2022-09-11 Thread akshit marwah

Hi Team, We are trying to shift from HDFS State Manager to Rocks DB State Manager, but while doing POC we realised it is using much more off-heap space than expected. Because of this, the executors get killed with : *out of** physical memory exception.* Could you please help in understanding, wh

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-06 Thread karan alang

Hello All, i've a Spark structured streaming job which reads from Kafka, does processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is processing heavy) I'm consistently seeing the following warnings: ``` 22/09/06 16:

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed

;> >> I am seeing weird behavior in our spark structured streaming application >> where the offerts are not getting picked by the streaming job. >> >> If I delete the checkpoint directory and run the job again, I can see the >> data for the first batch but it is n

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread Artemis User

, I am seeing weird behavior in our spark structured streaming application where the offerts are not getting picked by the streaming job. If I delete the checkpoint directory and run the job again, I can see the data for the first batch but it is not picking up new offsets again from the

Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed

Hi, I am seeing weird behavior in our spark structured streaming application where the offerts are not getting picked by the streaming job. If I delete the checkpoint directory and run the job again, I can see the data for the first batch but it is not picking up new offsets again from the next

Spark Structured streaming(batch mode) - running dependent jobs concurrently

2022-06-15 Thread karan alang

Hello All, I've a Structured Streaming program running on GCP dataproc which reads data from Kafka every 10 mins, and then does processing. This is a multi-tenant system i.e. the program will read data from multiple customers. In my current code, i'm looping over the customers passing it to the 3

Re: How to gracefully shutdown Spark Structured Streaming

2022-02-26 Thread Gourav Sengupta

Dear Mich, a super duper note of thanks, I had to spend around two weeks to figure this out :) Regards, Gourav Sengupta On Sat, Feb 26, 2022 at 10:43 AM Mich Talebzadeh wrote: > > > On Mon, 26 Apr 2021 at 10:21, Mich Talebzadeh > wrote: > >> >> Spark Structured

Re: How to gracefully shutdown Spark Structured Streaming

2022-02-26 Thread Mich Talebzadeh

On Mon, 26 Apr 2021 at 10:21, Mich Talebzadeh wrote: > > Spark Structured Streaming AKA SSS is a very useful tool in dealing with > Event Driven Architecture. In an Event Driven Architecture, there is > generally a main loop that listens for events and then triggers a call-back >

Re: Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-12 Thread Alchemist

Looks like somehow related to API unable to send data from executor to driver If I set spark master to local I get these 6 files When spark.master is local& InputReportAndFileName fileName file:///Users/abc/Desktop/test/Streaming/d& InputReportAndFileName fileName file:

Re: Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-12 Thread Alchemist

Here is Spark's API definition, unable to understand what does it mean to have "unknown" file. We are processing file we will have fileName I have 7 files it can print 3 and miss other 4 /** * Returns the holding file name or empty string if it is unknown. */ def getInp

Spark Structured Streaming org.apache.spark.sql.functions.input_file_name Intermittently Missing FileName

2021-10-11 Thread Alchemist

Hello all, I am trying to extract file name like following but intermittanly we are getting empty file name. Step 1: Get SchemaStructType jsonSchema = sparkSession.read() .option("multiLine", true) .json("src/main/resources/sample.json") .schema();Step2: Get Input DataSetDataset inputDS = sparkS

Re: Spark Structured Streaming Continuous Trigger on multiple sinks

2021-09-12 Thread Alex Ott

Just don't call .awaitTermindation() because it blocks execution of the next line of code. You can assign result of .start() to a specific variable, or put them into list/array. And to wait until one of the streams finishes, use spark.streams.awaitAnyTermination() or something like this (https://s

Spark Structured Streaming Continuous Trigger on multiple sinks

2021-08-25 Thread S

Hello, I have a structured streaming job that needs to be able to write to multiple sinks. We are using *Continuous* Trigger *and not* *Microbatch* Trigger. 1. When we use the foreach method using: *dataset1.writeStream.foreach(kafka ForEachWriter logic).trigger(ContinuousMode).start().awaitTermi

Spark Structured Streaming Dyanamic Allocation

2021-08-11 Thread Zhenyu Hu

Hey folks： does Spark Structured Streaming have any plans for dynamic scaling? Currently Spark only has a dynamic scaling mechanism for batch jobs

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread ayan guha

gt;>> from the customer balance, >>>> - and the "refunded" means I must give the transaction amount back to >>>> the customer balance >>>> >>>> So, technically, we cannot process something that is not "AUTHORIZED" >&

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh

cess a refund for a transaction that has NOT >>> been PROCESSED yet. >>> >>> >>> *You have an authorisation, then the actual transaction and maybe a >>>> refund some time in the future. You want to proceed with a transaction only >>>>

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

t has NOT >> been PROCESSED yet. >> >> >> *You have an authorisation, then the actual transaction and maybe a >>> refund some time in the future. You want to proceed with a transaction only >>> if you've seen the auth but in an eventually consiste

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh

t >> always happen.* > > > That's absolutely the case! So, yes, That's correct. > > *You are asking in the case of receiving the transaction before the auth >> how to retry later? * > > > Yeah! I'm struggling for days on how to solve with Spark St

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

ase of receiving the transaction before the auth > how to retry later? * Yeah! I'm struggling for days on how to solve with Spark Structured Streaming... *Right now you are discarding those transactions that didn't match so you > instead would need to persist them somewhere and e

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Sebastian Piu

ic_value.status").alias("status")). \ >>>> writeStream. \ >>>> outputMode('append'). \ >>>> option("truncate", "false&

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

queryName(config['MDVariables']['newtopic']). \ >>> start() >>> >>> result = streamingDataFrame.select( \ >>> col("parsed_value.rowkey").alias("rowkey"

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh

outputMode('append'). \ >> option("truncate", "false"). \ >> *foreachBatch(sendToSink). \* >> trigger(processingTime='30 seconds'). \ >> option(&#

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

cessingTime='30 seconds'). \ > option('checkpointLocation', checkpoint_path). \ > queryName(config['MDVariables']['topic']). \ > start() > print(result) > > ex

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh

igQuery batch table s.writeTableToBQ(df, "append", config['MDVariables']['targetDataset'],config['MDVariables']['targetTable']) df.unpersist() print(f"""wrote to DB&quo

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

Hello! Sure thing! I'm reading them *separately*, both are apps written with Scala + Spark Structured Streaming. I feel like I missed some details on my original thread (sorry it was past 4 AM) and it was getting frustrating Please let me try to clarify some points: *Transactions Cr

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh

Can you please clarify if you are reading these two topics separately or within the same scala or python script in Spark Structured Streaming? HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any a

[Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira

transactions-created, - transaction-processed - Even though the schema is not exactly the same, they all share a correlation_id, which is their "transaction_id" So, long story short, I've got 2 consumers, one for each topic, and all I wanna do is sink them in a chain order. I

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Mich Talebzadeh

;> | foo|null|2| Count when word is foo >> | foo| 1|1| When word is foo and num is 1 >> | foo| 2|1| When word is foo and num is 2 >> +++-+ >> >> >> So rollup() returns a subset of the rows returned by cube(). From the >>

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Amit Joshi

+ > > > So rollup() returns a subset of the rows returned by cube(). From the > above, rollup returns 6 rows whereas cube returns 8 rows. Here are the > missing rows. > > +++-+ > |word| num|count| > +++-+ > |null| 1|1| Word is null an

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Mich Talebzadeh

re are the missing rows. +++-+ |word| num|count| +----++-----+ |null| 1|1| Word is null and num is 1 |null| 2|3| Word is null and num is 2 +++-+ Now back to Spark Structured Streaming (SSS), we have basic aggregations ""&quo

Re: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Amit Joshi

y need to store the state to update the count. So spark structured streaming states will come into picture. As now with batch programming, we can do it with > df.rollup(col1,col2).count But if I try to use it with spark structured streaming state, will it store the state of all the group

Re: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Mich Talebzadeh

16:37, Amit Joshi wrote: > Appreciate if someone could give some pointers in the question below. > > -- Forwarded message - > From: Amit Joshi > Date: Tue, Jun 15, 2021 at 12:19 PM > Subject: [Spark]Does Rollups work with spark structured streaming with > state.

Fwd: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Amit Joshi

Appreciate if someone could give some pointers in the question below. -- Forwarded message - From: Amit Joshi Date: Tue, Jun 15, 2021 at 12:19 PM Subject: [Spark]Does Rollups work with spark structured streaming with state. To: spark-user Hi Spark-Users, Hope you are all

Does Rollups work with spark structured streaming with state.

2021-06-14 Thread Amit Joshi

Hi Spark-Users, Hope you are all doing well. Recently I was looking into rollup operations in spark. As we know state based aggregation is supported in spark structured streaming. I was wondering if rollup operations are also supported? Like the state of previous aggregation on the rollups are

1 2 3 4 5 6 7 >

1 - 100 of 607 matches

Mail list logo