Re: Feature to restart Spark job from previous failure point

2023-09-05 Thread Mich Talebzadeh
Hi Dipayan,

You ought to maintain data source consistency minimising changes. upstream.
Spark is not a Swiss Army knife :)

Anyhow, we already do this in spark structured streaming with the concept
of checkpointing.You can do so by implementing


   - Checkpointing
   - Stateful processing in Spark.
   - Retry mechanism:


In Pyspark you can use

rdd.checkpoint("hdfs://")  # chekpointing rdd
or
dataframe.write.option("path",
"hdfs://("overwrite").saveAsTable("checkpointed_table")
# checkpointing a DF

Retry mechanism

something like below

def myfunction(input_file_path,checkpoint_directory, max_retries):
retries = 0
while retries < max_retries:
   try:
   .
   except Exception as e:
  print(f"Error: {str(e)}")
  retries += 1
 if retries < max_retries:
   print(f"Retrying... (Retry {retries}/{max_retries})")
 else:
   print("Max retries reached. Exiting.")
   break

Remember checkpointing incurs I/O and is expensive!. You can use cloud
buckets for checkpointing as well

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 5 Sept 2023 at 10:12, Dipayan Dev  wrote:

> Hi Team,
>
> One of the biggest pain points we're facing is when Spark reads upstream
> partition data and during Action, the upstream also gets refreshed and the
> application fails with 'File not exists' error. It could happen that the
> job has already spent a reasonable amount of time, and re-running the
> entire application is unwanted.
>
> I know the general solution to this is to handle how the upstream is
> managing the data, but is there a way to tackle this problem from the Spark
> applicable side? One approach I was thinking of is to at least save some
> state of operations done by Spark job till that point, and on a retry,
> resume the operation from that point?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>


Feature to restart Spark job from previous failure point

2023-09-04 Thread Dipayan Dev
Hi Team,

One of the biggest pain points we're facing is when Spark reads upstream
partition data and during Action, the upstream also gets refreshed and the
application fails with 'File not exists' error. It could happen that the
job has already spent a reasonable amount of time, and re-running the
entire application is unwanted.

I know the general solution to this is to handle how the upstream is
managing the data, but is there a way to tackle this problem from the Spark
applicable side? One approach I was thinking of is to at least save some
state of operations done by Spark job till that point, and on a retry,
resume the operation from that point?



With Best Regards,

Dipayan Dev