Re: [ccache] Duplicate object files in the ccache - possible optimization?

Frank Klotz Tue, 08 Nov 2011 14:58:02 -0800

 On 11/08/2011 02:24 PM, Joel Rosdahl wrote:

On 7 November 2011 18:49, Frank Klotz<frank.kl...@alcatel-lucent.com>  wrote:

[...] That aside, however, with the advent of direct mode, there ARE two
hashes possible for any given object file - the direct mode hash (hashing all
the sources that go into the compilation) and the preprocessed hash (hashing
the result of running all those sources through the preprocessor).

Well, yes and no. There's only one hash for a given object file: the hash of
the output of the preprocessor. This hash is used to look up the object file in
the cache (i.e., the object file is named after the hash).


Then, for the direct mode, there is one hash for each combination of source
code files (i.e., the file to compile and all its include files) and compiler
flags that results in the same preprocessor output. The mapping between
different source code hashes and the resulting preprocessor hashes is stored in
.manifest files in the cache. A manifest file is looked up using (and thus
named after) a hash of only the main file and compilation flags.

And any time there is a cache miss, ccache has computed both those hashes,
hasn't it?

As mentioned above, it starts by computing a hash of the input source file and
the command line options. It then looks up the manifest file, continues hashing
include file sets found in the manifest and compares them with the actual
include files. If there's a match, the object file name (i.e., the preprocessor
hash) can be read in the manifest.

This is documented in the manual under "The direct mode":
http://ccache.samba.org/manual.html#_the_direct_mode

If it's hard to understand, I would be happy for any suggestions on how to
improve it. :-)

Umm, well, the fact that I didn't get it doesn't mean there is a problemwith the documentation - maybe just that I am not too good atunderstanding it!

I guess I would ask/suggest that it be made clearer that the 'datastructure called “manifest”' is just another file in the cache, namedwith its hash and the suffix ".manifest"; and also that the "referencesto cached compilation results" in the manifest files ARE theproprocessor hashes (that is, if in fact they ARE - I'm still not 100%sure.)

It's good to know that any object file stored in the cache ISidentified/named by the hash of its preprocessor output - the directmode is just a quick way to decide that the given set of source fileswould get the same preprocessor output if cpp were actually run. (Am Istarting to get it now?)

[...] And it appears to me that in many cases, the resulting object file
occurs twice in the cache, once under each hash.

Well, the object file is only stored once for a given preprocessor hash.

And currently, those two occurrences are two separate files, which could be
combined into a single inode with two hard-linked directory entries.

If there are multiple object files in the cache with the same content, then
that's because different preprocessor outputs have resulted in identical object
files.

Hmmm. Shouldn't that be hard to do? Evidently it's not, given that 30%of the files in my cache have twins (or triplets or whatever). Ok, soit's not so hard as that - while unused macros and constants are droppedduring preprocessing, unused structure definitions and other languageconstructs cannot be, so I guess it is not so hard after all to createdifferent preprocessed files which generate identical .o files.

And of course in that case, ccache itself has no way of knowing that theresultant files are identical.

  I can imagine two ways of storing identical object files only once:

- Introduce an object file store indexed by the object file hash. Entries in
   the manifest files would then refer directly to those file names and
   the files would also be stored under their preprocessor hash name. However,
   on a cache miss, there will be extra performance penalty since the hash of
   the object file needs to be calculated as well. That's probably measurably
   bad.
- Or: Create a compactation tool which can be run on the cache once in a while.
   I think a good search engine term for this would be "data deduplication".


Agreed.

Thanks!
Frank

-- Joel


_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

Re: [ccache] Duplicate object files in the ccache - possible optimization?

Reply via email to