Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Thanks Jungtaek
Ok I got it. I'll test it and check if the loss of efficiency is acceptable.


Le mer. 23 déc. 2020 à 23:29, Jungtaek Lim  a
écrit :

> Please refer my previous answer -
> https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E
> Probably we may want to add it in the SS guide doc. We didn't need it as
> it just didn't work with eventually consistent model, and now it works
> anyway but is very inefficient.
>
>
> On Thu, Dec 24, 2020 at 6:16 AM David Morin 
> wrote:
>
>> Does it work with the standard AWS S3 solution and its new
>> consistency model
>> 
>> ?
>>
>> Le mer. 23 déc. 2020 à 18:48, David Morin  a
>> écrit :
>>
>>> Thanks.
>>> My Spark applications run on nodes based on docker images but this is a
>>> standalone mode (1 driver - n workers)
>>> Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
>>> Consistent view
>>> 
>>>  ?
>>>
>>> Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh  a
>>> écrit :
>>>
 Yes. It is necessary to have a distributed file system because all the
 workers need to read/write to the checkpoint. The distributed file system
 has to be immediately consistent: When one node writes to it, the other
 nodes should be able to read it immediately

 The solutions/workarounds depend on where you are hosting your Spark
 application.



 *From: *David Morin 
 *Date: *Wednesday, December 23, 2020 at 11:08 AM
 *To: *"user@spark.apache.org" 
 *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints
 fail



 *CAUTION*: This email originated from outside of the organization. Do
 not click links or open attachments unless you can confirm the sender and
 know the content is safe.



 Hello,



 I have an issue with my Pyspark job related to checkpoint.



 Caused by: org.apache.spark.SparkException: Job aborted due to stage
 failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
 task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
 java.lang.IllegalStateException: Error reading delta file
 file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
 HDFSStateStoreProvider[id = (op=0,part=3),dir =
 file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: 
 *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
 does not exist*



 This job is based on Spark 3.0.1 and Structured Streaming

 This Spark cluster (1 driver and 6 executors) works without hdfs. And
 we don't want to manage an hdfs cluster if possible.

 Is it necessary to have a distributed filesystem ? What are the
 different solutions/workarounds ?



 Thanks in advance

 David

>>>


Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Jungtaek Lim
Please refer my previous answer -
https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E
Probably we may want to add it in the SS guide doc. We didn't need it as it
just didn't work with eventually consistent model, and now it works anyway
but is very inefficient.


On Thu, Dec 24, 2020 at 6:16 AM David Morin 
wrote:

> Does it work with the standard AWS S3 solution and its new
> consistency model
> 
> ?
>
> Le mer. 23 déc. 2020 à 18:48, David Morin  a
> écrit :
>
>> Thanks.
>> My Spark applications run on nodes based on docker images but this is a
>> standalone mode (1 driver - n workers)
>> Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
>> Consistent view
>> 
>>  ?
>>
>> Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh  a
>> écrit :
>>
>>> Yes. It is necessary to have a distributed file system because all the
>>> workers need to read/write to the checkpoint. The distributed file system
>>> has to be immediately consistent: When one node writes to it, the other
>>> nodes should be able to read it immediately
>>>
>>> The solutions/workarounds depend on where you are hosting your Spark
>>> application.
>>>
>>>
>>>
>>> *From: *David Morin 
>>> *Date: *Wednesday, December 23, 2020 at 11:08 AM
>>> *To: *"user@spark.apache.org" 
>>> *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints
>>> fail
>>>
>>>
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you can confirm the sender and
>>> know the content is safe.
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I have an issue with my Pyspark job related to checkpoint.
>>>
>>>
>>>
>>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
>>> task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
>>> java.lang.IllegalStateException: Error reading delta file
>>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
>>> HDFSStateStoreProvider[id = (op=0,part=3),dir =
>>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: 
>>> *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
>>> does not exist*
>>>
>>>
>>>
>>> This job is based on Spark 3.0.1 and Structured Streaming
>>>
>>> This Spark cluster (1 driver and 6 executors) works without hdfs. And we
>>> don't want to manage an hdfs cluster if possible.
>>>
>>> Is it necessary to have a distributed filesystem ? What are the
>>> different solutions/workarounds ?
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>> David
>>>
>>


Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Does it work with the standard AWS S3 solution and its new consistency model

?

Le mer. 23 déc. 2020 à 18:48, David Morin  a
écrit :

> Thanks.
> My Spark applications run on nodes based on docker images but this is a
> standalone mode (1 driver - n workers)
> Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
> Consistent view
> 
>  ?
>
> Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh  a
> écrit :
>
>> Yes. It is necessary to have a distributed file system because all the
>> workers need to read/write to the checkpoint. The distributed file system
>> has to be immediately consistent: When one node writes to it, the other
>> nodes should be able to read it immediately
>>
>> The solutions/workarounds depend on where you are hosting your Spark
>> application.
>>
>>
>>
>> *From: *David Morin 
>> *Date: *Wednesday, December 23, 2020 at 11:08 AM
>> *To: *"user@spark.apache.org" 
>> *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail
>>
>>
>>
>> *CAUTION*: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hello,
>>
>>
>>
>> I have an issue with my Pyspark job related to checkpoint.
>>
>>
>>
>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
>> task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
>> java.lang.IllegalStateException: Error reading delta file
>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
>> HDFSStateStoreProvider[id = (op=0,part=3),dir =
>> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: 
>> *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
>> does not exist*
>>
>>
>>
>> This job is based on Spark 3.0.1 and Structured Streaming
>>
>> This Spark cluster (1 driver and 6 executors) works without hdfs. And we
>> don't want to manage an hdfs cluster if possible.
>>
>> Is it necessary to have a distributed filesystem ? What are the different
>> solutions/workarounds ?
>>
>>
>>
>> Thanks in advance
>>
>> David
>>
>


Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Thanks.
My Spark applications run on nodes based on docker images but this is a
standalone mode (1 driver - n workers)
Can we use S3 directly with consistency addon like s3guard (s3a) or AWS
Consistent view

 ?

Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh  a
écrit :

> Yes. It is necessary to have a distributed file system because all the
> workers need to read/write to the checkpoint. The distributed file system
> has to be immediately consistent: When one node writes to it, the other
> nodes should be able to read it immediately
>
> The solutions/workarounds depend on where you are hosting your Spark
> application.
>
>
>
> *From: *David Morin 
> *Date: *Wednesday, December 23, 2020 at 11:08 AM
> *To: *"user@spark.apache.org" 
> *Subject: *[EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hello,
>
>
>
> I have an issue with my Pyspark job related to checkpoint.
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost
> task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4):
> java.lang.IllegalStateException: Error reading delta file
> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of
> HDFSStateStoreProvider[id = (op=0,part=3),dir =
> file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: 
> *file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta
> does not exist*
>
>
>
> This job is based on Spark 3.0.1 and Structured Streaming
>
> This Spark cluster (1 driver and 6 executors) works without hdfs. And we
> don't want to manage an hdfs cluster if possible.
>
> Is it necessary to have a distributed filesystem ? What are the different
> solutions/workarounds ?
>
>
>
> Thanks in advance
>
> David
>


Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Lalwani, Jayesh
Yes. It is necessary to have a distributed file system because all the workers 
need to read/write to the checkpoint. The distributed file system has to be 
immediately consistent: When one node writes to it, the other nodes should be 
able to read it immediately
The solutions/workarounds depend on where you are hosting your Spark 
application.

From: David Morin 
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "user@spark.apache.org" 
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hello,

I have an issue with my Pyspark job related to checkpoint.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in 
stage 16997.0 (TID 206609, 10.XXX, executor 4): 
java.lang.IllegalStateException: Error reading delta file 
file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of 
HDFSStateStoreProvider[id = (op=0,part=3),dir = 
file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: 
file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not 
exist

This job is based on Spark 3.0.1 and Structured Streaming
This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't 
want to manage an hdfs cluster if possible.
Is it necessary to have a distributed filesystem ? What are the different 
solutions/workarounds ?

Thanks in advance
David