On 12/9/19 7:51 PM, Rob Herring wrote: > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior > <bige...@linutronix.de> wrote: >> >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote: >>> Is there a memory usage issue for the systems that led to this thread? >> >> No, no memory issue led to this thread. I was just testing my patch and >> I assumed that I did something wrong in the counting/lock drop/lock >> acquire/allocate path because the array was hardly used. So I started to >> look deeper… >> Once I figured out everything was fine, I was curious if everyone is >> aware of the different phandle creation by dtc vs POWER. And I posted >> the mail in the thread. >> Once you confirmed that everything is "known / not an issue" I was ready >> to take off [0]. >> >> Later more replies came in such as one mail [1] from Rob describing the >> original reason with 814 phandles. _Here_ I was just surprised that 1024 >> were used over 64 entries for a benefit of 60ms. I understand that this >> is low concern for you because that memory is released if modules are >> not enabled. I usually see that module support is left enabled. >> >> However, Rob suggested / asked about the fixed size array (this is how I >> understood it): >> |And yes, as mentioned earlier I don't like the complexity. I didn't >> |from the start and I'm I'm still of the opinion we should have a >> |fixed or 1 time sized true cache (i.e. smaller than total # of >> |phandles). That would solve the RT memory allocation and locking issue >> |too. >> >> so I attempted to ask if we should have the fixed size array maybe >> with the hash_32() instead the mask. This would make my other patch >> obsolete because the fixed size array should not have a RT issue. The >> hash_32() part here would address the POWER issue where the cache is >> currently not used efficiently. >> >> If you want instead to keep things as-is then this is okay from my side. >> If you want to keep this cache off on POWER then I could contribute a >> patch doing so. > > It turns out there's actually a bug in the current implementation. If > we have multiple phandles with the same mask, then we leak node > references if we miss in the cache and re-assign the cache entry.
Aaargh. Patch sent. > Easily fixed I suppose, but holding a ref count for a cached entry > seems wrong. That means we never have a ref count of 0 on every node > with a phandle. It will go to zero when the cache is freed. My memory is that we free the cache as part of removing an overlay. I'll verify whether my memory is correct. -Frank > > I've done some more experiments with the performance. I've come to the > conclusion that just measuring boot time is not a good way at least if > the time is not a significant percentage of the total. I couldn't get > any measurable data. I'm using a RK3399 Rock960 board. It has 190 > phandles. I get about 1500 calls to of_find_node_by_phandle() during > boot. Note that about the first 300 are before we have any timekeeping > (the prior measurements also would not account for this). Those calls > have no cache in the current implementation and are cached in my > implementation. > > no cache: 20124 us > current cache: 819 us > > new cache (64 entry): 4342 us > new cache (128 entry): 2875 us > new cache (256 entry): 1235 us > > To get some idea on the time spent before timekeeping is up, if we > take the avg miss time is ~17us (20124/1200), then we're spending > about ~5ms on lookups before the cache is enabled. I'd estimate the > new cache takes ~400us before timekeeping is up as there's 11 misses > early. > >>From these numbers, it seems the miss rate has a significant impact on > performance for the different sizes. But taken from the original 20+ > ms, they all look like good improvement. > > Rob >