Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Martin Andersson
You should check the release notes and upgrade instructions.

From: rajat kumar 
Sent: Thursday, September 1, 2022 12:44
To: user @spark 
Subject: Moving to Spark 3x from Spark2


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


Hello Members,

We want to move to Spark 3 from Spark2.4 .

Are there any changes we need to do at code level which can break the existing 
code?

Will it work by simply changing the version of spark & scala ?

Regards
Rajat


Question regarding checkpointing with kafka structured streaming

2022-08-22 Thread Martin Andersson
I was looking around for some documentation regarding how checkpointing (or 
rather, delivery semantics) is done when consuming from kafka with structured 
streaming and I stumbled across this old documentation (that still somehow 
exists in latest versions) at 
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#checkpoints.

This page (which I assume is from around the time of Spark 2.4?) describes that 
storing offsets using checkpoiting is the least reliable method and goes 
further into how to use kafka or an external storage to commit offsets.

It also says
If you enable Spark checkpointing, offsets will be stored in the checkpoint. 
(...) Furthermore, you cannot recover from a checkpoint if your application 
code has changed.

This all leaves me with several questions:

  1.  Is the above quote still true for Spark 3, that the checkpoint will break 
if you change the code? (I know it does if you change the subscribe pattern)

  2.  Why was the option to manually commit offsets asynchronous to kafka 
removed when it was deemed more reliable than checkpointing? Not to mention 
that storing offsets in kafka allows you to use all the tools offered with 
kafka to easily reset/rewind offsets on specific topics, which doesn't seem to 
be possible when using checkpoints.

  3.  From a user perspective, storing offsets in kafka offers more features. 
From a developer perspective, having to re-implement offset storage across 
several output systems (such as HDFS, AWS S3 and other object storages) seems 
like a lot of unnecessary work and re-inventing the wheel.
Is the discussion leading up to the decision to only support storing offsets 
with checkpointing documented anywhere, perhaps in a jira?

Thanks for your time