Re: Operator checkpointing in distributed in-memory store

Sandesh Hegde Thu, 03 Dec 2015 11:15:37 -0800

Ashish,

Two more questions for you,
What all in-memory store did you evaluate?  Are they YARN compatible?


Thank you for your contribution.

Sandesh

On Wed, Dec 2, 2015 at 10:53 AM Gaurav Gupta <[email protected]> wrote:

> Ashish,
>
> I have couple of questions
> 1. Are there standard APIs for distributed In-Memory stores or is this
> implementation specific to one particular tool?
> 2. Will In-Memory Store compete with DataTorrent Apps for cluster
> resources (memory/cpu)?
> 3. What is the purging policy? Who is responsible for cleaning up the
> resources for completed/failed/aborted applications? This becomes important
> when you want to launch an Application using previous Application Id
>
> Thanks
> - Gaurav
>
> > On Dec 2, 2015, at 10:07 AM, Ashish Tadose <[email protected]>
> wrote:
> >
> > Thanks Gaurav,
> >
> > I have finished baseline implementations of StorageAgent and also tested
> it
> > with demo applications by explicitly specifying it in DAG configuration
> as
> > below and it works fine.
> >
> > dag.setAttribute(OperatorContext.STORAGE_AGENT, agent);
> >
> > I also had to make some changes to StramClient to pass additional
> > information such as applicationId as it doesn't passes currently.
> >
> > I am going to create JIRA task for this feature and will document design
> &
> > implementation strategy there.
> >
> > Thx,
> > Asish
> >
> >
> > On Wed, Dec 2, 2015 at 11:26 PM, Gaurav Gupta <[email protected]>
> > wrote:
> >
> >> Just to add you can plugin your storage agent using attribute
> >> STORAGE_AGENT (
> >>
> https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#STORAGE_AGENT
> >> )
> >>
> >> Thanks
> >> - Gaurav
> >>
> >>> On Dec 2, 2015, at 9:51 AM, Gaurav Gupta <[email protected]>
> wrote:
> >>>
> >>> Ashish,
> >>>
> >>> You are right that Exactly once semantics can’t be achieved through
> >> Async FS write.
> >>> Did you try new StorageAgent with your Application? If yes do you have
> >> any numbers to compare?
> >>>
> >>> Thanks
> >>> - Gaurav
> >>>
> >>>> On Dec 2, 2015, at 9:33 AM, Ashish Tadose <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>>>
> >>>> Application uses large number of in-memory dimension store partitions
> to
> >>>> hold high cardinally aggregated data and also many intermediate
> >> operators
> >>>> keep cache data for reference look ups which are not-transient.
> >>>>
> >>>> Total application partitions were more than 1000 which makes lot of
> >>>> operator to checkpoint and in term lot of frequent Hdfs write, rename
> &
> >>>> delete operations which became bottleneck.
> >>>>
> >>>> Application requires Exactly once semantics with idempotent operators
> >> which
> >>>> I suppose can not be achieved through Async fs writes, please correct
> >> me If
> >>>> I'm wrong here.
> >>>>
> >>>> Also application computes streaming aggregations of high cardinality
> >>>> incoming data streams and reference caches are update frequently so
> not
> >>>> sure how much incremental checkpointing will help here.
> >>>>
> >>>> Despite this specific application I strongly think it would be good to
> >> have
> >>>> StorageAgent backed by distributed in-memory store as alternative in
> >>>> platform.
> >>>>
> >>>> Ashish
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Dec 2, 2015 at 10:35 PM, Munagala Ramanath <
> [email protected]
> >> <mailto:[email protected]>>
> >>>> wrote:
> >>>>
> >>>>> Ashish,
> >>>>>
> >>>>> In the current release, the HDFS writes are asynchronous so I'm
> >> wondering
> >>>>> if
> >>>>> you could elaborate on how much latency you are observing both with
> and
> >>>>> without
> >>>>> checkpointing (i.e. after your changes to make operators stateless).
> >>>>>
> >>>>> Also any information on how much non-transient data is being
> >> checkpointed
> >>>>> in
> >>>>> each operator would also be useful. There is an effort under way to
> >>>>> implement
> >>>>> incremental checkpointing which should improve things when there is a
> >> lot
> >>>>> state
> >>>>> but very little that changes from window to window.
> >>>>>
> >>>>> Ram
> >>>>>
> >>>>>
> >>>>> On Wed, Dec 2, 2015 at 8:51 AM, Ashish Tadose <
> [email protected]
> >> <mailto:[email protected]>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Currently Apex engine provides operator checkpointing in Hdfs ( with
> >> Hdfs
> >>>>>> backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
> >>>>>>
> >>>>>> We have observed that for applications having large number of
> operator
> >>>>>> instances, hdfs checkpointing introduces latency in DAG which
> degrades
> >>>>>> overall application performance.
> >>>>>> To resolve this we had to review all operators in DAG and had to
> make
> >> few
> >>>>>> operators stateless.
> >>>>>>
> >>>>>> As operator check-pointing is critical functionality of Apex
> streaming
> >>>>>> platform to ensure fault tolerant behavior, platform should also
> >> provide
> >>>>>> alternate StorageAgents which will work seamlessly with large
> >>>>> applications
> >>>>>> that requires Exactly once semantics.
> >>>>>>
> >>>>>> HDFS read/write latency is limited and doesn't improve beyond
> certain
> >>>>> point
> >>>>>> because of disk io & staging writes. Having alternate strategy to
> this
> >>>>>> check-pointing in fault tolerant distributed in-memory grid would
> >> ensure
> >>>>>> application stability and performance is not impacted.
> >>>>>>
> >>>>>> I have developed a in-memory storage agent which I would like to
> >>>>> contribute
> >>>>>> as alternate StorageAgent for checkpointing.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ashish
> >>>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: Operator checkpointing in distributed in-memory store

Reply via email to