Re: Operator checkpointing in distributed in-memory store

Gaurav Gupta Wed, 02 Dec 2015 10:25:58 -0800

Ashish,

Sounds good and looking forward to it..


Thanks
- Gaurav

> On Dec 2, 2015, at 10:07 AM, Ashish Tadose <[email protected]> wrote:
> 
> Thanks Gaurav,
> 
> I have finished baseline implementations of StorageAgent and also tested it
> with demo applications by explicitly specifying it in DAG configuration as
> below and it works fine.
> 
> dag.setAttribute(OperatorContext.STORAGE_AGENT, agent);
> 
> I also had to make some changes to StramClient to pass additional
> information such as applicationId as it doesn't passes currently.
> 
> I am going to create JIRA task for this feature and will document design &
> implementation strategy there.
> 
> Thx,
> Asish
> 
> 
> On Wed, Dec 2, 2015 at 11:26 PM, Gaurav Gupta <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Just to add you can plugin your storage agent using attribute
>> STORAGE_AGENT (
>> https://www.datatorrent.com/docs/apidocs/com/datatorrent/api/Context.OperatorContext.html#STORAGE_AGENT
>> )
>> 
>> Thanks
>> - Gaurav
>> 
>>> On Dec 2, 2015, at 9:51 AM, Gaurav Gupta <[email protected]> wrote:
>>> 
>>> Ashish,
>>> 
>>> You are right that Exactly once semantics can’t be achieved through
>> Async FS write.
>>> Did you try new StorageAgent with your Application? If yes do you have
>> any numbers to compare?
>>> 
>>> Thanks
>>> - Gaurav
>>> 
>>>> On Dec 2, 2015, at 9:33 AM, Ashish Tadose <[email protected]
>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>> 
>>>> Application uses large number of in-memory dimension store partitions to
>>>> hold high cardinally aggregated data and also many intermediate
>> operators
>>>> keep cache data for reference look ups which are not-transient.
>>>> 
>>>> Total application partitions were more than 1000 which makes lot of
>>>> operator to checkpoint and in term lot of frequent Hdfs write, rename &
>>>> delete operations which became bottleneck.
>>>> 
>>>> Application requires Exactly once semantics with idempotent operators
>> which
>>>> I suppose can not be achieved through Async fs writes, please correct
>> me If
>>>> I'm wrong here.
>>>> 
>>>> Also application computes streaming aggregations of high cardinality
>>>> incoming data streams and reference caches are update frequently so not
>>>> sure how much incremental checkpointing will help here.
>>>> 
>>>> Despite this specific application I strongly think it would be good to
>> have
>>>> StorageAgent backed by distributed in-memory store as alternative in
>>>> platform.
>>>> 
>>>> Ashish
>>>> 
>>>> 
>>>> 
>>>> On Wed, Dec 2, 2015 at 10:35 PM, Munagala Ramanath <[email protected] 
>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>> wrote:
>>>> 
>>>>> Ashish,
>>>>> 
>>>>> In the current release, the HDFS writes are asynchronous so I'm
>> wondering
>>>>> if
>>>>> you could elaborate on how much latency you are observing both with and
>>>>> without
>>>>> checkpointing (i.e. after your changes to make operators stateless).
>>>>> 
>>>>> Also any information on how much non-transient data is being
>> checkpointed
>>>>> in
>>>>> each operator would also be useful. There is an effort under way to
>>>>> implement
>>>>> incremental checkpointing which should improve things when there is a
>> lot
>>>>> state
>>>>> but very little that changes from window to window.
>>>>> 
>>>>> Ram
>>>>> 
>>>>> 
>>>>> On Wed, Dec 2, 2015 at 8:51 AM, Ashish Tadose <[email protected]
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> Currently Apex engine provides operator checkpointing in Hdfs ( with
>> Hdfs
>>>>>> backed StorageAgents i.e. FSStorageAgent & AsyncFSStorageAgent )
>>>>>> 
>>>>>> We have observed that for applications having large number of operator
>>>>>> instances, hdfs checkpointing introduces latency in DAG which degrades
>>>>>> overall application performance.
>>>>>> To resolve this we had to review all operators in DAG and had to make
>> few
>>>>>> operators stateless.
>>>>>> 
>>>>>> As operator check-pointing is critical functionality of Apex streaming
>>>>>> platform to ensure fault tolerant behavior, platform should also
>> provide
>>>>>> alternate StorageAgents which will work seamlessly with large
>>>>> applications
>>>>>> that requires Exactly once semantics.
>>>>>> 
>>>>>> HDFS read/write latency is limited and doesn't improve beyond certain
>>>>> point
>>>>>> because of disk io & staging writes. Having alternate strategy to this
>>>>>> check-pointing in fault tolerant distributed in-memory grid would
>> ensure
>>>>>> application stability and performance is not impacted.
>>>>>> 
>>>>>> I have developed a in-memory storage agent which I would like to
>>>>> contribute
>>>>>> as alternate StorageAgent for checkpointing.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ashish

Re: Operator checkpointing in distributed in-memory store

Reply via email to