Re: Operator checkpointing in distributed in-memory store

Munagala Ramanath Wed, 02 Dec 2015 09:05:53 -0800

Ashish,

In the current release, the HDFS writes are asynchronous so I'm wondering if
you could elaborate on how much latency you are observing both with and
without
checkpointing (i.e. after your changes to make operators stateless).


Also any information on how much non-transient data is being checkpointed in
each operator would also be useful. There is an effort under way to
implement
incremental checkpointing which should improve things when there is a lot
state
but very little that changes from window to window.

Ram


On Wed, Dec 2, 2015 at 8:51 AM, Ashish Tadose <[email protected]>
wrote:

> Hi All,
>
> Currently Apex engine provides operator checkpointing in Hdfs ( with Hdfs
> backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
>
> We have observed that for applications having large number of operator
> instances, hdfs checkpointing introduces latency in DAG which degrades
> overall application performance.
> To resolve this we had to review all operators in DAG and had to make few
> operators stateless.
>
> As operator check-pointing is critical functionality of Apex streaming
> platform to ensure fault tolerant behavior, platform should also provide
> alternate StorageAgents which will work seamlessly with large applications
> that requires Exactly once semantics.
>
> HDFS read/write latency is limited and doesn't improve beyond certain point
> because of disk io & staging writes. Having alternate strategy to this
> check-pointing in fault tolerant distributed in-memory grid would ensure
> application stability and performance is not impacted.
>
> I have developed a in-memory storage agent which I would like to contribute
> as alternate StorageAgent for checkpointing.
>
> Thanks,
> Ashish
>

Re: Operator checkpointing in distributed in-memory store

Reply via email to