dexonsmith added reviewers: jansvoboda11, Bigcheese. dexonsmith added a subscriber: Bigcheese. dexonsmith added a comment.
This looks much cleaner; thanks for the update! I'm not sure this has decided whether it's writing special case code for building a CompilerInvocation's module context hash, or creating (and using) a new generalized hash builder. If the latter, I think it'd be worth splitting the hash builder out and landing it separately (in LLVM), then adopting it here; I think that's what you were doing with `stable_hash_code` (although that had the downsides of reifying city hash and not reusing the Hasher tools we have). This could take a few rounds of review to get the design right / get consensus on... not sure what you're able to invest. A few design questions I think you'd want to think about: - What's the interaction between hash_value and (e.g.) updateHash? (It'd be nice if implementing `updateHash` gave you an implementation of `hash_value` for free.) - Should `updateHash` be templated on the HasherT, or use type erasure? (E.g., a stripped down `HashBuilder` type could have a single member that was `llvm::unique_function<void(ArrayRef<uint8_t>)>`, and all the `updateHash()` functions are just free functions) - Should the hash builder reference a HasherT, or own it? - What are the API affordances for hashing structure? (E.g., ensuring that calls to updateHash with "a" and "bc" give a different hash than "ab" and "c".) On the other hand, if you're not going to have time to generalize this, I'd suggest simplifying the code a bit, merging the "detail" subclass into the builder type. I think ModuleContextHashBuilder might be a better name than ModuleHash; I think in clangBasic, rather than clangSerialization, since it could be used for things other than serialization and doesn't really have any dependencies there. In either case, there's a decent chunk of new code, and I think you should add some unit test coverage (a new file in llvm/unittests/Support/, in a separate patch, and/or clang/unittests/ (depending on direction)). A have a bunch of comments inline, some which would only apply depending on which direction you're going with this. @Bigcheese and @jansvoboda11 , any thoughts? ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:19-22 +/// Wrapper interface around a hash type. +/// The only methods expected from the hash type are: +/// * a default constructor +/// * void update(llvm::ArrayRef<uint8_t>) ---------------- I think this builder is an interesting idea, and could be generally useful. - It'd fit better in llvm/Support I think than hidden here. - Taking the hasher by reference might be more flexible. Clients can ask the builder to do some of the hashing, but do some on their own. ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:36 +/// ``` +template <typename HashValueT> class Hash { +private: ---------------- I don't think the names `HashValueT` and `Hash` are quite right (I tend to think of "hash" as a value as well). - Maybe HasherT instead of HashValueT? - Maybe HashBuilder instead of Hash? Nit: I think (?) we more often use `class` for type template parameters than `typename`. ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:43 +public: + HashValueT HashValue; + ---------------- For a builder interface, it might be more convenient to take the hasher by reference. ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:47-51 + template <typename InputIteratorT> + void updateRange(InputIteratorT first, InputIteratorT last) { + for (auto It = first; It != last; ++It) + update(*It); + } ---------------- Because this doesn't hash the distance of the range, it won't see a difference between `{{0}, {1, 2}}` and `{{0, 1}, {2}}`. Sometimes that's fine, but sometimes it isn't. I worry about generalizing this... unless the onus is on the caller to hash the size when appropriate? ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:54-57 + template <typename T> + struct is_hashable_data + : std::integral_constant<bool, llvm::is_integral_or_enum<T>::value || + std::is_pointer<T>::value> {}; ---------------- I'm not sure we want it to be easy to hash pointers by-value. That's appropriate for an in-memory hash table, but not much else. ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:77-83 + struct update_impl<T, + typename std::enable_if<is_hashable_data<T>::value>::type> + : update_impl<llvm::ArrayRef<uint8_t>> { + update_impl(Hash &H, T value) + : update_impl<llvm::ArrayRef<uint8_t>>( + H, llvm::ArrayRef<uint8_t>(reinterpret_cast<uint8_t *>(&value), + sizeof(value))) {} ---------------- Is there a way of reducing the boilerplate for these `update_impl` functions, which are currently types? Could they maybe be free functions? If you hit trouble with using std::enable_if_t in a function template, I seem to remember needing to use something like this in the past: ``` lang=c++ std::enable_if_t<whatever<T>::value, bool> = false ``` ================ Comment at: clang/include/clang/Serialization/ModuleHash.h:96-104 + // Forward `llvm::StringRef` to `llvm::ArrayRef<uint8_t>`. + template <> + struct update_impl<llvm::StringRef> : update_impl<llvm::ArrayRef<uint8_t>> { + update_impl(Hash &H, llvm::StringRef Value) + : update_impl<llvm::ArrayRef<uint8_t>>( + H, llvm::ArrayRef<uint8_t>( + reinterpret_cast<const uint8_t *>(Value.data()), ---------------- I wonder if this should also hash the size of the StringRef. Maybe it should also be called `updateRange`. MD5/SHA1/etc. hashers have the perspective that they're seeing a flat stream of bytes, and I think this streams in the bytes from the string directly. I think the right way to model hashing the context hash is as a serialization, and serializing a string ref requires serializing the size before the content, so that the deserializer knows how big the string is. ================ Comment at: clang/lib/Frontend/CompilerInvocation.cpp:4483-4484 // Start the signature with the compiler version. - // FIXME: We'd rather use something more cryptographically sound than - // CityHash, but this will do for now. - hash_code code = hash_value(getClangFullRepositoryVersion()); ---------------- I'm not sure MD5 is cryptographically sound either, so it probably makes sense to leave this FIXME in / update it. ================ Comment at: clang/lib/Frontend/CompilerInvocation.cpp:4580 - return llvm::APInt(64, code).toString(36, /*Signed=*/false); + return llvm::APInt(64, Hash.getValue()).toString(36, /*Signed=*/false); } ---------------- Do we just want 64 bits, or do we want the full hash? ================ Comment at: clang/lib/Serialization/ModuleFileExtension.cpp:16-18 -llvm::hash_code ModuleFileExtension::hashExtension(llvm::hash_code Code) const { - return Code; -} ---------------- Seems like this could stay in the `.cpp`, not sure if moving it to the header is important (or could be done separately to make the functionality changes more obvious). ================ Comment at: llvm/include/llvm/Support/VersionTuple.h:164-167 +template <typename HashT> void updateHash(HashT &Hash, const VersionTuple &V) { + Hash.update({V.getMajor(), V.getMinor().getValueOr(0), + V.getSubminor().getValueOr(0), V.getBuild().getValueOr(0)}); +} ---------------- Could this be declared as a friend, similar to `hash_value`? Note that if the new builder code doesn't live in `llvm/`, probably this `updateHash` free function should be an implementation detail in the CompilerInvocation / ModuleHash code. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D102943/new/ https://reviews.llvm.org/D102943 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits