Re: DistributedMapCacheServer question

Mark Payne Thu, 09 Mar 2017 07:32:31 -0800

Excellent, thanks! Definitely looks like old records are not getting evicted. 
You have not yet created a JIRA for
this, correct?


Thanks
-Mark


> On Mar 9, 2017, at 10:24 AM, Joe Gresock <[email protected]> wrote:
> 
> Good instinct -- here's what I get:
> 
> nifi-app.log:2017-03-09 15:03:00,670 INFO [Distributed Cache Server
> Communications Thread: ac907dec-49a4-439e-99f5-1558f2358d87]
> org.wali.MinimalLockingWriteAheadLog
> org.wali.MinimalLockingWriteAheadLog@40569408 checkpointed with *4262902*
> Records and 0 Swap Files in 256302 milliseconds (Stop-the-world time = 1378
> milliseconds, Clear Edit Logs time = 19 millis), max Transaction ID 4263237
> 
> Looks like it's over 4.2 million records now.
> 
> On Thu, Mar 9, 2017 at 3:13 PM, Mark Payne <[email protected]> wrote:
> 
>> Joe,
>> 
>> That definitely sounds like a bug causing the eviction to not happen. Can
>> you grep your logs for the phrase
>> "checkpointed with"? You should have a line that tells you how many
>> records were written to the Snapshot.
>> You will certainly see a few of these types of messages, though, because
>> you have 1 for the FlowFile Repository,
>> one for Local State Management, and another one for the
>> DistributedMapCacheServer. I am curious to see if
>> you see the log message indicating 3 million+ records also.
>> 
>> Thanks
>> -Mark
>> 
>> 
>>> On Mar 8, 2017, at 7:13 PM, Joe Gresock <[email protected]> wrote:
>>> 
>>> Looking through the PersistenceMapCache and SimpleMapCache, it seems like
>>> lots of these records should have been evicted by now.  We're up to 3.1
>>> million records on disk in the snapshot file.  My understanding is that
>>> when wali.checkpoint() is called, it collapses all the DELETE records in
>>> the journaled log and removes them before writing the snapshot file.  Is
>>> that accurate?
>>> 
>>> I feel like something is not going quite right with the eviction process.
>>> I am using 1.1.1, though, and I have noticed that the PersistentMapCache
>>> has changed in [1], so I might apply that patch and try some more
>>> experiments.
>>> 
>>> Would anyone be willing to try to replicate this behavior in NiFi 1.1.1?
>>> You should be able to do it as follows:
>>> Services:
>>> DistributedMapCacheServer, maximum cache entries = 100,000, FIFO
>> eviction,
>>> persistence directory specified
>>> DistributedMapCacheClientService, point to the same host and port
>>> 
>>> Flow:
>>> GenerateFlowFile (randomize 1K binary files in batches of 10, schedule 10
>>> threads) ->HashContent (md5) into hash.value -> DetectDuplicate with
>>> identifier = ${hash.value}, description = ., no age off, select your
>> cache
>>> client, cache identifier = true
>>> 
>>> This should cause the snapshot file to exceed 100,000 keys pretty
>> quickly,
>>> and as far as I can tell, it never goes back down.  This in itself is
>> not a
>>> problem, but when the cache gets really big, it tends to crash our
>> cluster
>>> when NiFi reloads it into memory.
>>> 
>>> [1] https://issues.apache.org/jira/browse/NIFI-3214
>>> 
>>> 
>>> On Wed, Mar 8, 2017 at 11:06 AM, Joe Gresock <[email protected]> wrote:
>>> 
>>>> Thanks Bryan, I'll start looking through the PersistenceMapCache.  This
>>>> morning I checked back and the snapshot file now has 2.9 million keys
>> in it.
>>>> 
>>>> On Tue, Mar 7, 2017 at 4:39 PM, Bryan Bende <[email protected]> wrote:
>>>> 
>>>>> Joe,
>>>>> 
>>>>> I'm not that familiar with the persistence part of the DMCS, although
>>>>> I do know that it uses the write-ahead-log that is also used by the
>>>>> flow file repo.
>>>>> 
>>>>> The code for PersistenceMapCache is here:
>>>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/
>>>>> nifi-standard-services/nifi-distributed-cache-services-
>>>>> bundle/nifi-distributed-cache-server/src/main/java/org/
>>>>> apache/nifi/distributed/cache/server/map/PersistentMapCache.java
>>>>> 
>>>>> It looks like the WAL is check-pointed during puts here:
>>>>> 
>>>>> final long modCount = modifications.getAndIncrement();
>>>>> if ( modCount > 0 && modCount % 100000 == 0 ) {
>>>>>   wali.checkpoint();
>>>>> }
>>>>> 
>>>>> And during deletes here:
>>>>> 
>>>>> final long modCount = modifications.getAndIncrement();
>>>>> if (modCount > 0 && modCount % 1000 == 0) {
>>>>>   wali.checkpoint();
>>>>> }
>>>>> 
>>>>> Not sure if that was intentional that put operations check point every
>>>>> 100k and and deletes check point every 1k.
>>>>> 
>>>>> Maybe Mark or others could shed some light on why the snapshot is
>>>>> reaching 3GB in size.
>>>>> 
>>>>> -Bryan
>>>>> 
>>>>> 
>>>>> On Tue, Mar 7, 2017 at 7:07 AM, Joe Gresock <[email protected]>
>> wrote:
>>>>>> Hi folks,
>>>>>> 
>>>>>> Is there a technical description of how the DistributedMapCacheServer
>>>>>> (DMCS) persistence works?  I've noticed the following on our cluster:
>>>>>> 
>>>>>> - I have the DMCS configured on port 4557 as FIFO with max 100,000
>>>>> entries,
>>>>>> and have specified a persistence directory
>>>>>> - I am using DetectDuplicate with the DMCS, and the individual key
>>>>> length
>>>>>> is 80 bytes, with a Description length of 1 byte.  By my count, this
>>>>> should
>>>>>> result in a pure data size of 7.7MB.
>>>>>> - I notice that the snapshot file in the persistence directory appears
>>>>> to
>>>>>> continue growing past the 100,000 limit, though this may be expected
>>>>>> depending on the implementation.  Since I know that the key will
>> contain
>>>>>> "json" in it, I can run the following command to count the number of
>>>>>> possible keys in the snapshot file (though I'm not sure if this is a
>>>>> good
>>>>>> way of measuring how many keys are actually cached): grep -oa json
>>>>> snapshot
>>>>>> | wc -l
>>>>>> - When the snapshot file reaches around 3GB, the DMCS has a hard time
>>>>>> staying up, and frequently becomes unreachable (netstat -tulpn | grep
>>>>> 4557
>>>>>> shows nothing).  At this point, in order to restore functionality I
>>>>> delete
>>>>>> the persistence directory and let it start over.
>>>>>> 
>>>>>> So my main questions are:
>>>>>> - How are the snapshot and partition files structured, and how can I
>>>>>> estimate how many keys are actually cached at a given time?
>>>>>> - Is the described behavior indicative of the cache exceeding the
>>>>>> configured max number of keys?
>>>>>> 
>>>>>> Thanks,
>>>>>> Joe
>>>>>> 
>>>>>> --
>>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>> I
>>>>>> have learned the secret of being content in any and every situation,
>>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>> can
>>>>> do
>>>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> I know what it is to be in need, and I know what it is to have plenty.
>> I
>>>> have learned the secret of being content in any and every situation,
>>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>> 
>> 
> 
> 
> -- 
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*

Re: DistributedMapCacheServer question

Reply via email to