Re: [ccache] Duplicate object files in the ccache - possible optimization?

Frank Klotz Mon, 07 Nov 2011 18:03:58 -0800

 On 11/07/2011 10:55 AM, Justin Lebar wrote:

Looking at a ccache with about 40,000 .o files in it (created with direct
mode turned on); of the 55 largest files, I found 11 pairs and one triplet
of identical object files.  That's almost 25% of redundant storage that
could have been avoided by looking at the preprocessed hash when there is no
hit in direct mode.

It's much more interesting to look at the whole cache, I think.


$ find -name '*.o' -type f | wc -l
39312
jlebar@turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum
| cut -f 1 -d ' ' | sort | uniq -d | wc -l
1230

So it looks like there's some duplication on my machine, but not a
ton.  I'd be curious if you got significantly different numbers.

Hi Justin,

Here's what I got:

> find -name '*.o' -type f | wc -l
43507

13087

I would say the difference is significant - you got 3%, while I got 30%.

This is in a project environment where we have fairly fast churn in anumber of widely-used header files, and I am guessing that the changesin those files often consist of addition of or changes to macros andconstants that are not used in all the files where the header files getincluded.

Now I am more than ready to agree that there is room for improvement inthe design and implementation of the header file layouts here, and weare working on that; but at the same time it looks to me like a fairlystraightforward enhancement to ccache could give us some performanceboost here. Since ccache already knows how to do preprocessed hashes, Ithink all I am looking to do is to have it use that existing code whenthe direct mode hash doesn't get a hit. Then put both hash names on theresulting object file, and voila: better hit rates and more unique filesin the cache.

It would be interesting to know if many others see anything at allsimilar to the numbers I have here (i.e., lots of people could use theenhancement I am suggesting), of if there is something unique about thisenvironment that leads to this much duplication in the cache.


Thanks,
Frank

On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
<frank.kl...@alcatel-lucent.com>  wrote:

  Hi Martin,

Thanks for your responses.

s/index hash/direct mode hash/g

Apologies - I had a brain burp and was using the wrong terminology.

That aside, however, with  the advent of direct mode, there ARE two hashes
possible for any given object file - the direct mode hash (hashing all the
sources that go into the compilation) and the preprocessed hash (hashing the
result of running all those sources through the preprocessor).  And any time
there is a cache miss, ccache has computed both those hashes, hasn't it?
  (Or maybe not - if not, see discussion below.)  And it appears to me that
in many cases, the resulting object file occurs twice in the cache, once
under each hash.  And currently, those two occurrences are two separate
files, which could be combined into a single inode with two hard-linked
directory entries.

Or am I confused about how direct mode interacts with preprocessed mode?  If
running in direct mode, does ccache never compute the preprocessed hash?  If
not, it obviously could, and I would recommend that it should.  Why?
  Because when changes are made to a widely-used header file, it very
commonly occurs that those changes only actually modify the preprocessor
output of a small subset of the sources that include that header file, while
many other sources don't use the changes (say, definition of new macros or
new constants), so end up with the same preprocessed output, and the same
object file, even though the input header files and direct mode hash did
change).  In that case, ccache could still find hits in the cache with the
preprocessed mode, even if it's a miss with the direct mode hash.  If ccache
does not get a direct mode hit, it certainly will have to RUN the
preprocessor to recompile the file - how much extra cost to compute the
preprocessed hash, look it up (to avoid recompiling if it is found with THAT
hash), and if a compile is still needed, store the resulting object file
inode with 2 directory entries rather than just one?

The way I read the doc about how direct mode works, I thought it would
compute the direct mode hash, and if no hit, "fall back to preprocessed
mode".  I thought that meant it would compute the preprocessed hash and look
for that too.   Is that incorrect - does it only compute ONE hash in all
cases - a direct mode hash if running in direct mode and a preprocessed hash
if not in direct mode?  If so, then let's modify my suggested enhancement to
be that in direct mode, calculate and use the preprocessed hash whenever
there is no hit with direct mode, and create hard links using all computed
hashes to the one single object file inode that eventually exists in the
ccache.  I don't think direct mode and preprocessed mode HAVE to be mutually
exclusive - when direct mode gets a miss, preprocessed mode can still often
provide a hit.

And if no preprocessed hash gets computed/stored when running in direct
mode, then I suspect that the reason I see so many pairs of identical object
files in my ccache is because of the situation I describe above, where a
header file change has triggered a direct mode hash miss, but preprocessing
the sources has resulted in an identical preprocessed file which was then
passed to the compiler which produced an identical object file.  But ccache
didn't KNOW that they were identical, because it didn't compute the
preprocessed hash.

Looking at a ccache with about 40,000 .o files in it (created with direct
mode turned on); of the 55 largest files, I found 11 pairs and one triplet
of identical object files.  That's almost 25% of redundant storage that
could have been avoided by looking at the preprocessed hash when there is no
hit in direct mode.

Thanks,
Frank


On 11/07/2011 12:53 AM, Martin Pool wrote:

On 5 November 2011 11:12, Frank Klotz<frank.kl...@alcatel-lucent.com>
  wrote:

  I used ccache at my previous employer, and was very convinced of its
value.
  Now that I have started a new job, I am in the process of trying to
bring
the new shop on board with ccache, so I have been doing lots of test runs
and looking at things.  Here is one thing I am thinking could add some
value.

Looking through the ccache, I find many pairs of files which have
different
names (different hashes), but exactly identical content.  This actually
makes sense, as each file would have an index hash and a preprocessed
hash,
and since ccache needs to be able to find a match on either, then both
need
to be in the cache.

What is an index hash?

  (Actually, thinking about it, I'm a little surprised
that there are any files in the ccache that DON'T appear twice -
shouldn't
EVERY compilation have 2 hashes?)

I don't understand why you would expect that.

It seems like you expect there is another indirection layer by which
ccache tries to find jobs that produce identical output.  I don't
think there is one at present.  I don't think this would happen very
often in reality, except perhaps for trivial cases like compiling
empty files, and that's not so important to accelerate, and it will
not use up much disk space.

If you're getting duplicated cache files due to for instance doing
builds in different directories or from different trees that produce
identical output you could change the ccache options to make it less
stringent.

But it seems to me that it would make a lot of sense to store the data of
these 2 files only once, by hard-linking the 2 names to the same inode.
  (For filesystems that support hard links, of course!)  Every time ccache
does an actual compilation and stores a file in the cache, it should
store
it under hard links for BOTH hashes - the indexed hash and the
proprocessed
hash.  And if it gets a hash miss on the indexed hash but a hit on the
preprocessed hash, then it should add the missed index hash as a hard
link
to the file found.  So a given file (inode) in the cache could actually
be
referenced by MANY directory entries: one preprocessed hash, and multiple
index hashes for various different combinations of source files and
header
files which end up producing the same output when passed through the
preprocessor.

This mail is the first time google has heard of "ccache indexed hash"...

This could increase the storage efficiency of the ccache.

Of course, since not every filesystem supports hard links, the simplest
solution was of course just to have multiple file copies.  So I guess
adding
code to do this would require some way to determine if the filesystem the
cache is on can in fact support hardlinks.

If you think this sounds like a good idea, but don't have bandwidth to do
it, I would be willing to give it a try.  Any hints on where to start
would
of course be welcome.

Thanks,
Frank Klotz
_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

Re: [ccache] Duplicate object files in the ccache - possible optimization?

Reply via email to