Re: Query around Spark Checkpoints

2020-09-29 Thread Jungtaek Lim
Sorry I have no idea on Delta Lake. You may get a better answer from Delta
Lake mailing list.

One thing is clear that stateful processing is simply an essential feature
on almost every streaming framework. If you're struggling with something
around the state feature and trying to find a workaround then probably
something is going wrong. Please feel free to share it.

Thanks,
Jungtaek Lim (HeartSaVioR)

2020년 9월 30일 (수) 오전 1:14, Bryan Jeffrey 님이 작성:

> Jungtaek,
>
> How would you contrast stateful streaming with checkpoint vs. the idea of
> writing updates to a Delta Lake table, and then using the Delta Lake table
> as a streaming source for our state stream?
>
> Thank you,
>
> Bryan
>
> On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh 
> wrote:
>
>> Thank You Jungtaek and Amit ! This is very helpful indeed !
>>
>> Cheers,
>>
>> Debu
>>
>> On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala
>>>
>>> You would need to implement CheckpointFileManager by yourself, which is
>>> tightly integrated with HDFS (parameters and return types of methods are
>>> mostly from HDFS). That wouldn't mean it's impossible to
>>> implement CheckpointFileManager against a non-filesystem, but it'd be
>>> non-trivial to override all of the functionalities and make it work
>>> seamlessly.
>>>
>>> Required consistency is documented via javadoc of CheckpointFileManager
>>> - please go through reading it, and evaluate whether your target storage
>>> can fulfill the requirement.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi 
>>> wrote:
>>>
 Hi,

 As far as I know, it depends on whether you are using spark streaming
 or structured streaming.
 In spark streaming you can write your own code to checkpoint.
 But in case of structured streaming it should be file location.
 But main question in why do you want to checkpoint in
 Nosql, as it's eventual consistence.


 Regards
 Amit

 On Sunday, September 27, 2020, Debabrata Ghosh 
 wrote:

> Hi,
> I had a query around Spark checkpoints - Can I store the
> checkpoints in NoSQL or Kafka instead of Filesystem ?
>
> Regards,
>
> Debu
>
>
>


>>>
>>>
>>
>>
>
>


Re: Query around Spark Checkpoints

2020-09-29 Thread Bryan Jeffrey
Jungtaek,

How would you contrast stateful streaming with checkpoint vs. the idea of
writing updates to a Delta Lake table, and then using the Delta Lake table
as a streaming source for our state stream?

Thank you,

Bryan

On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh 
wrote:

> Thank You Jungtaek and Amit ! This is very helpful indeed !
>
> Cheers,
>
> Debu
>
> On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim 
> wrote:
>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala
>>
>> You would need to implement CheckpointFileManager by yourself, which is
>> tightly integrated with HDFS (parameters and return types of methods are
>> mostly from HDFS). That wouldn't mean it's impossible to
>> implement CheckpointFileManager against a non-filesystem, but it'd be
>> non-trivial to override all of the functionalities and make it work
>> seamlessly.
>>
>> Required consistency is documented via javadoc of CheckpointFileManager -
>> please go through reading it, and evaluate whether your target storage can
>> fulfill the requirement.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi 
>> wrote:
>>
>>> Hi,
>>>
>>> As far as I know, it depends on whether you are using spark streaming or
>>> structured streaming.
>>> In spark streaming you can write your own code to checkpoint.
>>> But in case of structured streaming it should be file location.
>>> But main question in why do you want to checkpoint in
>>> Nosql, as it's eventual consistence.
>>>
>>>
>>> Regards
>>> Amit
>>>
>>> On Sunday, September 27, 2020, Debabrata Ghosh 
>>> wrote:
>>>
 Hi,
 I had a query around Spark checkpoints - Can I store the
 checkpoints in NoSQL or Kafka instead of Filesystem ?

 Regards,

 Debu

>>>


Re: Query around Spark Checkpoints

2020-09-28 Thread Debabrata Ghosh
Thank You Jungtaek and Amit ! This is very helpful indeed !

Cheers,

Debu

On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim 
wrote:

>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala
>
> You would need to implement CheckpointFileManager by yourself, which is
> tightly integrated with HDFS (parameters and return types of methods are
> mostly from HDFS). That wouldn't mean it's impossible to
> implement CheckpointFileManager against a non-filesystem, but it'd be
> non-trivial to override all of the functionalities and make it work
> seamlessly.
>
> Required consistency is documented via javadoc of CheckpointFileManager -
> please go through reading it, and evaluate whether your target storage can
> fulfill the requirement.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi 
> wrote:
>
>> Hi,
>>
>> As far as I know, it depends on whether you are using spark streaming or
>> structured streaming.
>> In spark streaming you can write your own code to checkpoint.
>> But in case of structured streaming it should be file location.
>> But main question in why do you want to checkpoint in
>> Nosql, as it's eventual consistence.
>>
>>
>> Regards
>> Amit
>>
>> On Sunday, September 27, 2020, Debabrata Ghosh 
>> wrote:
>>
>>> Hi,
>>> I had a query around Spark checkpoints - Can I store the
>>> checkpoints in NoSQL or Kafka instead of Filesystem ?
>>>
>>> Regards,
>>>
>>> Debu
>>>
>>


Re: Query around Spark Checkpoints

2020-09-27 Thread Jungtaek Lim
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

You would need to implement CheckpointFileManager by yourself, which is
tightly integrated with HDFS (parameters and return types of methods are
mostly from HDFS). That wouldn't mean it's impossible to
implement CheckpointFileManager against a non-filesystem, but it'd be
non-trivial to override all of the functionalities and make it work
seamlessly.

Required consistency is documented via javadoc of CheckpointFileManager -
please go through reading it, and evaluate whether your target storage can
fulfill the requirement.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi 
wrote:

> Hi,
>
> As far as I know, it depends on whether you are using spark streaming or
> structured streaming.
> In spark streaming you can write your own code to checkpoint.
> But in case of structured streaming it should be file location.
> But main question in why do you want to checkpoint in
> Nosql, as it's eventual consistence.
>
>
> Regards
> Amit
>
> On Sunday, September 27, 2020, Debabrata Ghosh 
> wrote:
>
>> Hi,
>> I had a query around Spark checkpoints - Can I store the checkpoints
>> in NoSQL or Kafka instead of Filesystem ?
>>
>> Regards,
>>
>> Debu
>>
>


Re: Query around Spark Checkpoints

2020-09-27 Thread Amit Joshi
Hi,

As far as I know, it depends on whether you are using spark streaming or
structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh 
wrote:

> Hi,
> I had a query around Spark checkpoints - Can I store the checkpoints
> in NoSQL or Kafka instead of Filesystem ?
>
> Regards,
>
> Debu
>