Excellent, thanks! Definitely looks like old records are not getting evicted. You have not yet created a JIRA for this, correct?
Thanks -Mark > On Mar 9, 2017, at 10:24 AM, Joe Gresock <[email protected]> wrote: > > Good instinct -- here's what I get: > > nifi-app.log:2017-03-09 15:03:00,670 INFO [Distributed Cache Server > Communications Thread: ac907dec-49a4-439e-99f5-1558f2358d87] > org.wali.MinimalLockingWriteAheadLog > org.wali.MinimalLockingWriteAheadLog@40569408 checkpointed with *4262902* > Records and 0 Swap Files in 256302 milliseconds (Stop-the-world time = 1378 > milliseconds, Clear Edit Logs time = 19 millis), max Transaction ID 4263237 > > Looks like it's over 4.2 million records now. > > On Thu, Mar 9, 2017 at 3:13 PM, Mark Payne <[email protected]> wrote: > >> Joe, >> >> That definitely sounds like a bug causing the eviction to not happen. Can >> you grep your logs for the phrase >> "checkpointed with"? You should have a line that tells you how many >> records were written to the Snapshot. >> You will certainly see a few of these types of messages, though, because >> you have 1 for the FlowFile Repository, >> one for Local State Management, and another one for the >> DistributedMapCacheServer. I am curious to see if >> you see the log message indicating 3 million+ records also. >> >> Thanks >> -Mark >> >> >>> On Mar 8, 2017, at 7:13 PM, Joe Gresock <[email protected]> wrote: >>> >>> Looking through the PersistenceMapCache and SimpleMapCache, it seems like >>> lots of these records should have been evicted by now. We're up to 3.1 >>> million records on disk in the snapshot file. My understanding is that >>> when wali.checkpoint() is called, it collapses all the DELETE records in >>> the journaled log and removes them before writing the snapshot file. Is >>> that accurate? >>> >>> I feel like something is not going quite right with the eviction process. >>> I am using 1.1.1, though, and I have noticed that the PersistentMapCache >>> has changed in [1], so I might apply that patch and try some more >>> experiments. >>> >>> Would anyone be willing to try to replicate this behavior in NiFi 1.1.1? >>> You should be able to do it as follows: >>> Services: >>> DistributedMapCacheServer, maximum cache entries = 100,000, FIFO >> eviction, >>> persistence directory specified >>> DistributedMapCacheClientService, point to the same host and port >>> >>> Flow: >>> GenerateFlowFile (randomize 1K binary files in batches of 10, schedule 10 >>> threads) ->HashContent (md5) into hash.value -> DetectDuplicate with >>> identifier = ${hash.value}, description = ., no age off, select your >> cache >>> client, cache identifier = true >>> >>> This should cause the snapshot file to exceed 100,000 keys pretty >> quickly, >>> and as far as I can tell, it never goes back down. This in itself is >> not a >>> problem, but when the cache gets really big, it tends to crash our >> cluster >>> when NiFi reloads it into memory. >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-3214 >>> >>> >>> On Wed, Mar 8, 2017 at 11:06 AM, Joe Gresock <[email protected]> wrote: >>> >>>> Thanks Bryan, I'll start looking through the PersistenceMapCache. This >>>> morning I checked back and the snapshot file now has 2.9 million keys >> in it. >>>> >>>> On Tue, Mar 7, 2017 at 4:39 PM, Bryan Bende <[email protected]> wrote: >>>> >>>>> Joe, >>>>> >>>>> I'm not that familiar with the persistence part of the DMCS, although >>>>> I do know that it uses the write-ahead-log that is also used by the >>>>> flow file repo. >>>>> >>>>> The code for PersistenceMapCache is here: >>>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/ >>>>> nifi-standard-services/nifi-distributed-cache-services- >>>>> bundle/nifi-distributed-cache-server/src/main/java/org/ >>>>> apache/nifi/distributed/cache/server/map/PersistentMapCache.java >>>>> >>>>> It looks like the WAL is check-pointed during puts here: >>>>> >>>>> final long modCount = modifications.getAndIncrement(); >>>>> if ( modCount > 0 && modCount % 100000 == 0 ) { >>>>> wali.checkpoint(); >>>>> } >>>>> >>>>> And during deletes here: >>>>> >>>>> final long modCount = modifications.getAndIncrement(); >>>>> if (modCount > 0 && modCount % 1000 == 0) { >>>>> wali.checkpoint(); >>>>> } >>>>> >>>>> Not sure if that was intentional that put operations check point every >>>>> 100k and and deletes check point every 1k. >>>>> >>>>> Maybe Mark or others could shed some light on why the snapshot is >>>>> reaching 3GB in size. >>>>> >>>>> -Bryan >>>>> >>>>> >>>>> On Tue, Mar 7, 2017 at 7:07 AM, Joe Gresock <[email protected]> >> wrote: >>>>>> Hi folks, >>>>>> >>>>>> Is there a technical description of how the DistributedMapCacheServer >>>>>> (DMCS) persistence works? I've noticed the following on our cluster: >>>>>> >>>>>> - I have the DMCS configured on port 4557 as FIFO with max 100,000 >>>>> entries, >>>>>> and have specified a persistence directory >>>>>> - I am using DetectDuplicate with the DMCS, and the individual key >>>>> length >>>>>> is 80 bytes, with a Description length of 1 byte. By my count, this >>>>> should >>>>>> result in a pure data size of 7.7MB. >>>>>> - I notice that the snapshot file in the persistence directory appears >>>>> to >>>>>> continue growing past the 100,000 limit, though this may be expected >>>>>> depending on the implementation. Since I know that the key will >> contain >>>>>> "json" in it, I can run the following command to count the number of >>>>>> possible keys in the snapshot file (though I'm not sure if this is a >>>>> good >>>>>> way of measuring how many keys are actually cached): grep -oa json >>>>> snapshot >>>>>> | wc -l >>>>>> - When the snapshot file reaches around 3GB, the DMCS has a hard time >>>>>> staying up, and frequently becomes unreachable (netstat -tulpn | grep >>>>> 4557 >>>>>> shows nothing). At this point, in order to restore functionality I >>>>> delete >>>>>> the persistence directory and let it start over. >>>>>> >>>>>> So my main questions are: >>>>>> - How are the snapshot and partition files structured, and how can I >>>>>> estimate how many keys are actually cached at a given time? >>>>>> - Is the described behavior indicative of the cache exceeding the >>>>>> configured max number of keys? >>>>>> >>>>>> Thanks, >>>>>> Joe >>>>>> >>>>>> -- >>>>>> I know what it is to be in need, and I know what it is to have plenty. >>>>> I >>>>>> have learned the secret of being content in any and every situation, >>>>>> whether well fed or hungry, whether living in plenty or in want. I >> can >>>>> do >>>>>> all this through him who gives me strength. *-Philippians 4:12-13* >>>>> >>>> >>>> >>>> >>>> -- >>>> I know what it is to be in need, and I know what it is to have plenty. >> I >>>> have learned the secret of being content in any and every situation, >>>> whether well fed or hungry, whether living in plenty or in want. I can >>>> do all this through him who gives me strength. *-Philippians 4:12-13* >>>> >>> >>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have plenty. I >>> have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I can >> do >>> all this through him who gives me strength. *-Philippians 4:12-13* >> >> > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength. *-Philippians 4:12-13*
