Re: Operator checkpointing in distributed in-memory store

Ashish Tadose Wed, 02 Dec 2015 09:34:29 -0800

Application uses large number of in-memory dimension store partitions to
hold high cardinally aggregated data and also many intermediate operators
keep cache data for reference look ups which are not-transient.


Total application partitions were more than 1000 which makes lot of
operator to checkpoint and in term lot of frequent Hdfs write, rename &
delete operations which became bottleneck.

Application requires Exactly once semantics with idempotent operators which
I suppose can not be achieved through Async fs writes, please correct me If
I'm wrong here.

Also application computes streaming aggregations of high cardinality
incoming data streams and reference caches are update frequently so not
sure how much incremental checkpointing will help here.

Despite this specific application I strongly think it would be good to have
StorageAgent backed by distributed in-memory store as alternative in
platform.

Ashish



On Wed, Dec 2, 2015 at 10:35 PM, Munagala Ramanath <[email protected]>
wrote:

> Ashish,
>
> In the current release, the HDFS writes are asynchronous so I'm wondering
> if
> you could elaborate on how much latency you are observing both with and
> without
> checkpointing (i.e. after your changes to make operators stateless).
>
> Also any information on how much non-transient data is being checkpointed
> in
> each operator would also be useful. There is an effort under way to
> implement
> incremental checkpointing which should improve things when there is a lot
> state
> but very little that changes from window to window.
>
> Ram
>
>
> On Wed, Dec 2, 2015 at 8:51 AM, Ashish Tadose <[email protected]>
> wrote:
>
> > Hi All,
> >
> > Currently Apex engine provides operator checkpointing in Hdfs ( with Hdfs
> > backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
> >
> > We have observed that for applications having large number of operator
> > instances, hdfs checkpointing introduces latency in DAG which degrades
> > overall application performance.
> > To resolve this we had to review all operators in DAG and had to make few
> > operators stateless.
> >
> > As operator check-pointing is critical functionality of Apex streaming
> > platform to ensure fault tolerant behavior, platform should also provide
> > alternate StorageAgents which will work seamlessly with large
> applications
> > that requires Exactly once semantics.
> >
> > HDFS read/write latency is limited and doesn't improve beyond certain
> point
> > because of disk io & staging writes. Having alternate strategy to this
> > check-pointing in fault tolerant distributed in-memory grid would ensure
> > application stability and performance is not impacted.
> >
> > I have developed a in-memory storage agent which I would like to
> contribute
> > as alternate StorageAgent for checkpointing.
> >
> > Thanks,
> > Ashish
> >
>

Re: Operator checkpointing in distributed in-memory store

Reply via email to