Re: Query around Spark Checkpoints
Sorry I have no idea on Delta Lake. You may get a better answer from Delta Lake mailing list. One thing is clear that stateful processing is simply an essential feature on almost every streaming framework. If you're struggling with something around the state feature and trying to find a workaround then probably something is going wrong. Please feel free to share it. Thanks, Jungtaek Lim (HeartSaVioR) 2020년 9월 30일 (수) 오전 1:14, Bryan Jeffrey 님이 작성: > Jungtaek, > > How would you contrast stateful streaming with checkpoint vs. the idea of > writing updates to a Delta Lake table, and then using the Delta Lake table > as a streaming source for our state stream? > > Thank you, > > Bryan > > On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh > wrote: > >> Thank You Jungtaek and Amit ! This is very helpful indeed ! >> >> Cheers, >> >> Debu >> >> On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala >>> >>> You would need to implement CheckpointFileManager by yourself, which is >>> tightly integrated with HDFS (parameters and return types of methods are >>> mostly from HDFS). That wouldn't mean it's impossible to >>> implement CheckpointFileManager against a non-filesystem, but it'd be >>> non-trivial to override all of the functionalities and make it work >>> seamlessly. >>> >>> Required consistency is documented via javadoc of CheckpointFileManager >>> - please go through reading it, and evaluate whether your target storage >>> can fulfill the requirement. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi >>> wrote: >>> Hi, As far as I know, it depends on whether you are using spark streaming or structured streaming. In spark streaming you can write your own code to checkpoint. But in case of structured streaming it should be file location. But main question in why do you want to checkpoint in Nosql, as it's eventual consistence. Regards Amit On Sunday, September 27, 2020, Debabrata Ghosh wrote: > Hi, > I had a query around Spark checkpoints - Can I store the > checkpoints in NoSQL or Kafka instead of Filesystem ? > > Regards, > > Debu > > > >>> >>> >> >> > >
Re: Query around Spark Checkpoints
Jungtaek, How would you contrast stateful streaming with checkpoint vs. the idea of writing updates to a Delta Lake table, and then using the Delta Lake table as a streaming source for our state stream? Thank you, Bryan On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh wrote: > Thank You Jungtaek and Amit ! This is very helpful indeed ! > > Cheers, > > Debu > > On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim > wrote: > >> >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala >> >> You would need to implement CheckpointFileManager by yourself, which is >> tightly integrated with HDFS (parameters and return types of methods are >> mostly from HDFS). That wouldn't mean it's impossible to >> implement CheckpointFileManager against a non-filesystem, but it'd be >> non-trivial to override all of the functionalities and make it work >> seamlessly. >> >> Required consistency is documented via javadoc of CheckpointFileManager - >> please go through reading it, and evaluate whether your target storage can >> fulfill the requirement. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi >> wrote: >> >>> Hi, >>> >>> As far as I know, it depends on whether you are using spark streaming or >>> structured streaming. >>> In spark streaming you can write your own code to checkpoint. >>> But in case of structured streaming it should be file location. >>> But main question in why do you want to checkpoint in >>> Nosql, as it's eventual consistence. >>> >>> >>> Regards >>> Amit >>> >>> On Sunday, September 27, 2020, Debabrata Ghosh >>> wrote: >>> Hi, I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ? Regards, Debu >>>
Re: Query around Spark Checkpoints
Thank You Jungtaek and Amit ! This is very helpful indeed ! Cheers, Debu On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim wrote: > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala > > You would need to implement CheckpointFileManager by yourself, which is > tightly integrated with HDFS (parameters and return types of methods are > mostly from HDFS). That wouldn't mean it's impossible to > implement CheckpointFileManager against a non-filesystem, but it'd be > non-trivial to override all of the functionalities and make it work > seamlessly. > > Required consistency is documented via javadoc of CheckpointFileManager - > please go through reading it, and evaluate whether your target storage can > fulfill the requirement. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi > wrote: > >> Hi, >> >> As far as I know, it depends on whether you are using spark streaming or >> structured streaming. >> In spark streaming you can write your own code to checkpoint. >> But in case of structured streaming it should be file location. >> But main question in why do you want to checkpoint in >> Nosql, as it's eventual consistence. >> >> >> Regards >> Amit >> >> On Sunday, September 27, 2020, Debabrata Ghosh >> wrote: >> >>> Hi, >>> I had a query around Spark checkpoints - Can I store the >>> checkpoints in NoSQL or Kafka instead of Filesystem ? >>> >>> Regards, >>> >>> Debu >>> >>
Re: Query around Spark Checkpoints
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala You would need to implement CheckpointFileManager by yourself, which is tightly integrated with HDFS (parameters and return types of methods are mostly from HDFS). That wouldn't mean it's impossible to implement CheckpointFileManager against a non-filesystem, but it'd be non-trivial to override all of the functionalities and make it work seamlessly. Required consistency is documented via javadoc of CheckpointFileManager - please go through reading it, and evaluate whether your target storage can fulfill the requirement. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi wrote: > Hi, > > As far as I know, it depends on whether you are using spark streaming or > structured streaming. > In spark streaming you can write your own code to checkpoint. > But in case of structured streaming it should be file location. > But main question in why do you want to checkpoint in > Nosql, as it's eventual consistence. > > > Regards > Amit > > On Sunday, September 27, 2020, Debabrata Ghosh > wrote: > >> Hi, >> I had a query around Spark checkpoints - Can I store the checkpoints >> in NoSQL or Kafka instead of Filesystem ? >> >> Regards, >> >> Debu >> >
Re: Query around Spark Checkpoints
Hi, As far as I know, it depends on whether you are using spark streaming or structured streaming. In spark streaming you can write your own code to checkpoint. But in case of structured streaming it should be file location. But main question in why do you want to checkpoint in Nosql, as it's eventual consistence. Regards Amit On Sunday, September 27, 2020, Debabrata Ghosh wrote: > Hi, > I had a query around Spark checkpoints - Can I store the checkpoints > in NoSQL or Kafka instead of Filesystem ? > > Regards, > > Debu >