Re: Spark Streaming Checkpointing

Gabor Somogyi Fri, 04 Sep 2020 05:10:25 -0700

Hi Andras,

A general suggestion is to use Structured Streaming instead of DStreams
because it provides several things out of the box (stateful streaming,
etc...).
Kafka 0.8 is super old and deprecated (no security...). Do you have a
specific reason to use that?


BR,
G


On Thu, Sep 3, 2020 at 11:41 AM András Kolbert <kolbertand...@gmail.com>
wrote:

> Hi All,
>
> I have a Spark streaming application (2.4.4, Kafka 0.8 >> so Spark Direct
> Streaming) running just fine.
>
> I create a context in the following way:
>
> ssc = StreamingContext(sc, 60) opts = 
> {"metadata.broker.list":kafka_hosts,"auto.offset.reset": "largest",         
> "group.id": run_type}
> kvs = KafkaUtils.createDirectStream(ssc, [topic_listen], opts)
> kvs.checkpoint(120)
>
> lines = kvs.map(lambda row: row[1]) lines.foreachRDD(streaming_app)
> ssc.checkpoint(checkpoint)
>
> The streaming app at a high level does this:
>
>    - processes incoming batch
>    - unions to the dataframe from the previous batch and aggregates them
>
> Currently, I use checkpointing explicitly (df = df.checkpoint()) to
> optimise the lineage. Although this is quite an expensive exercise and was
> wondering if there is a better way to do this.
>
> I tried to disable this explicit checkpointing, as I have a periodical
> checkpointing (kvs.checkpoint(120) ) so I thought that the lineage will
> be kept to that checkpointed RDD. Although in reality that is not the case
> and processing keeps increasing over time.
>
> Am I doing something inherently wrong? Is there a better way of doing this?
>
> Thanks
> Andras
>

Re: Spark Streaming Checkpointing

Reply via email to