[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

mpark wrote:

> > It turns out to be kind of a pain to use with HeaderFileInfoTrait. More 
> > importantly, we can't really afford to key by the internal key type since 
> > that only hashes on the size of the file and the hash collision gets pretty 
> > bad. Moreover, the merging and condensing strategy seems rather simple.
> 
> Maybe we can use hash value of the input as the key?

By input, do you mean the file path? If so, as far as I understand we don't 
hash based that because different file paths will hash differently but they may 
actually be the same entity after all via symlinks.

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Chuanqi Xu via cfe-commits


@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;

ChuanqiXu9 wrote:

And also, this seems unsafe since ModuleManager can remove modules technically. 
Maybe it is better to insert a bool in ModuleFile to mark if its header info is 
loaded. 

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Chuanqi Xu via cfe-commits


@@ -6831,43 +6832,60 @@ std::optional 
ASTReader::isPreprocessedEntityInFileID(unsigned Index,
 return false;
 }
 
-namespace {
-
-  /// Visitor used to search for information about a header file.
-  class HeaderFileInfoVisitor {
-  FileEntryRef FE;
-std::optional HFI;
-
-  public:
-explicit HeaderFileInfoVisitor(FileEntryRef FE) : FE(FE) {}
-
-bool operator()(ModuleFile &M) {
-  HeaderFileInfoLookupTable *Table
-= static_cast(M.HeaderFileInfoTable);
-  if (!Table)
-return false;
+static void mergeHeaderFileInfoModuleBits(HeaderFileInfo &HFI,
+  bool isModuleHeader,
+  bool isTextualModuleHeader) {
+  HFI.isModuleHeader |= isModuleHeader;
+  if (HFI.isModuleHeader)
+HFI.isTextualModuleHeader = false;
+  else
+HFI.isTextualModuleHeader |= isTextualModuleHeader;
+}
 
-  // Look in the on-disk hash table for an entry for this file name.
-  HeaderFileInfoLookupTable::iterator Pos = Table->find(FE);
-  if (Pos == Table->end())
-return false;
+/// Merge the header file info provided by \p OtherHFI into the current
+/// header file info (\p HFI)
+static void mergeHeaderFileInfo(HeaderFileInfo &HFI,
+const HeaderFileInfo &OtherHFI) {
+  assert(OtherHFI.External && "expected to merge external HFI");
 
-  HFI = *Pos;
-  return true;
-}
+  HFI.isImport |= OtherHFI.isImport;
+  HFI.isPragmaOnce |= OtherHFI.isPragmaOnce;
+  mergeHeaderFileInfoModuleBits(HFI, OtherHFI.isModuleHeader,
+OtherHFI.isTextualModuleHeader);
 
-std::optional getHeaderFileInfo() const { return HFI; }
-  };
+  if (!HFI.LazyControllingMacro.isValid())
+HFI.LazyControllingMacro = OtherHFI.LazyControllingMacro;
 
-} // namespace
+  HFI.DirInfo = OtherHFI.DirInfo;
+  HFI.External = (!HFI.IsValid || HFI.External);
+  HFI.IsValid = true;
+}
 
 HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
-  HeaderFileInfoVisitor Visitor(FE);
-  ModuleMgr.visit(Visitor);
-  if (std::optional HFI = Visitor.getHeaderFileInfo())
-  return *HFI;
-
-  return HeaderFileInfo();
+  for (auto Iter = ModuleMgr.begin() + HeaderFileInfoIdx, End = 
ModuleMgr.end();
+   Iter != End; ++Iter) {
+if (auto *Table = static_cast(
+Iter->HeaderFileInfoTable)) {
+  auto &Info = Table->getInfoObj();
+  for (auto Iter = Table->data_begin(), End = Table->data_end();
+   Iter != End; ++Iter) {
+const auto *Item = Iter.getItem();
+// Determine the length of the key and the data.
+const auto &[KeyLen, DataLen] =
+HeaderFileInfoTrait::ReadKeyDataLength(Item);

ChuanqiXu9 wrote:

nit: maybe it is better to wrap such logics in ASTReaderInternals.h

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Chuanqi Xu via cfe-commits


@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;

ChuanqiXu9 wrote:

nit: Unloaded header file info idx.

And also I feel the name is odd. It is index for module files, not header file 
infos.

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Chuanqi Xu via cfe-commits

https://github.com/ChuanqiXu9 commented:

>  It turns out to be kind of a pain to use with HeaderFileInfoTrait. More 
> importantly, we can't really afford to key by the internal key type since 
> that only hashes on the size of the file and the hash collision gets pretty 
> bad. Moreover, the merging and condensing strategy seems rather simple.

Maybe we can use hash value of the input as the key?

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Chuanqi Xu via cfe-commits

https://github.com/ChuanqiXu9 edited 
https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Jan Svoboda via cfe-commits

jansvoboda11 wrote:

FWIW I tried running this patch through my `clang-scan-deps` benchmark on an 
internal Apple project and saw 16% increase in instruction count.

https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark edited https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark updated 
https://github.com/llvm/llvm-project/pull/140867

>From a82059f4e0953e327353f383c9728ef4a7a6eaac Mon Sep 17 00:00:00 2001
From: Michael Park 
Date: Tue, 20 May 2025 17:18:36 -0700
Subject: [PATCH] Lazily, but fully load 'HeaderFileInfo' table into memory.

---
 clang/include/clang/Serialization/ASTReader.h |  4 +
 clang/lib/Serialization/ASTReader.cpp | 80 ---
 clang/lib/Serialization/ASTReaderInternals.h  |  3 +-
 3 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/clang/include/clang/Serialization/ASTReader.h 
b/clang/include/clang/Serialization/ASTReader.h
index 57b0266af26bb..9b63ad1a7cde7 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;
+
+  llvm::DenseMap HeaderFileInfoLookup;
+
   /// Map from a DeclContext to its lookup tables.
   llvm::DenseMap Lookups;
diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
index d068f5e163176..f90f4b9798e35 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -4155,6 +4155,7 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
   if (Record[0]) {
 F.HeaderFileInfoTable = HeaderFileInfoLookupTable::Create(
 (const unsigned char *)F.HeaderFileInfoTableData + Record[0],
+(const unsigned char *)F.HeaderFileInfoTableData + 
sizeof(uint32_t),
 (const unsigned char *)F.HeaderFileInfoTableData,
 HeaderFileInfoTrait(*this, F));
 
@@ -6831,43 +6832,60 @@ std::optional 
ASTReader::isPreprocessedEntityInFileID(unsigned Index,
 return false;
 }
 
-namespace {
-
-  /// Visitor used to search for information about a header file.
-  class HeaderFileInfoVisitor {
-  FileEntryRef FE;
-std::optional HFI;
-
-  public:
-explicit HeaderFileInfoVisitor(FileEntryRef FE) : FE(FE) {}
-
-bool operator()(ModuleFile &M) {
-  HeaderFileInfoLookupTable *Table
-= static_cast(M.HeaderFileInfoTable);
-  if (!Table)
-return false;
+static void mergeHeaderFileInfoModuleBits(HeaderFileInfo &HFI,
+  bool isModuleHeader,
+  bool isTextualModuleHeader) {
+  HFI.isModuleHeader |= isModuleHeader;
+  if (HFI.isModuleHeader)
+HFI.isTextualModuleHeader = false;
+  else
+HFI.isTextualModuleHeader |= isTextualModuleHeader;
+}
 
-  // Look in the on-disk hash table for an entry for this file name.
-  HeaderFileInfoLookupTable::iterator Pos = Table->find(FE);
-  if (Pos == Table->end())
-return false;
+/// Merge the header file info provided by \p OtherHFI into the current
+/// header file info (\p HFI)
+static void mergeHeaderFileInfo(HeaderFileInfo &HFI,
+const HeaderFileInfo &OtherHFI) {
+  assert(OtherHFI.External && "expected to merge external HFI");
 
-  HFI = *Pos;
-  return true;
-}
+  HFI.isImport |= OtherHFI.isImport;
+  HFI.isPragmaOnce |= OtherHFI.isPragmaOnce;
+  mergeHeaderFileInfoModuleBits(HFI, OtherHFI.isModuleHeader,
+OtherHFI.isTextualModuleHeader);
 
-std::optional getHeaderFileInfo() const { return HFI; }
-  };
+  if (!HFI.LazyControllingMacro.isValid())
+HFI.LazyControllingMacro = OtherHFI.LazyControllingMacro;
 
-} // namespace
+  HFI.DirInfo = OtherHFI.DirInfo;
+  HFI.External = (!HFI.IsValid || HFI.External);
+  HFI.IsValid = true;
+}
 
 HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
-  HeaderFileInfoVisitor Visitor(FE);
-  ModuleMgr.visit(Visitor);
-  if (std::optional HFI = Visitor.getHeaderFileInfo())
-  return *HFI;
-
-  return HeaderFileInfo();
+  for (auto Iter = ModuleMgr.begin() + HeaderFileInfoIdx, End = 
ModuleMgr.end();
+   Iter != End; ++Iter) {
+if (auto *Table = static_cast(
+Iter->HeaderFileInfoTable)) {
+  auto &Info = Table->getInfoObj();
+  for (auto Iter = Table->data_begin(), End = Table->data_end();
+   Iter != End; ++Iter) {
+const auto *Item = Iter.getItem();
+// Determine the length of the key and the data.
+const auto &[KeyLen, DataLen] =
+HeaderFileInfoTrait::ReadKeyDataLength(Item);
+// Read the key.
+const auto &Key = Info.ReadKey(Item, KeyLen);
+if (auto EKey = Info.getFile(Key)) {
+  auto Data = Info.ReadData(Key, Item + KeyLen, DataLen);
+  auto [Iter, Inserted] = HeaderFileInfoLookup.try_emplace(*EKey, 
Data);
+  if (!Inserted)
+mergeHeaderFileInfo(Iter->second, Data);
+}
+  }
+}
+  }
+  HeaderFileInfoIdx = ModuleMgr.size();
+  return HeaderFileInfoLookup.lookup(FE);
 }
 
 void ASTReader::ReadPragmaDiagnosticMappings(D

[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark edited https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark updated 
https://github.com/llvm/llvm-project/pull/140867

>From 1013ab2297935cc3a7eccf8a812984d9b454a27f Mon Sep 17 00:00:00 2001
From: Michael Park 
Date: Tue, 20 May 2025 17:18:36 -0700
Subject: [PATCH] Lazily, but fully load 'HeaderFileInfo' table into memory.

---
 clang/include/clang/Serialization/ASTReader.h |  4 +
 clang/lib/Serialization/ASTReader.cpp | 80 ---
 clang/lib/Serialization/ASTReaderInternals.h  |  3 +-
 3 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/clang/include/clang/Serialization/ASTReader.h 
b/clang/include/clang/Serialization/ASTReader.h
index 57b0266af26bb..9b63ad1a7cde7 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;
+
+  llvm::DenseMap HeaderFileInfoLookup;
+
   /// Map from a DeclContext to its lookup tables.
   llvm::DenseMap Lookups;
diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
index d068f5e163176..93766be18f775 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -4155,6 +4155,7 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
   if (Record[0]) {
 F.HeaderFileInfoTable = HeaderFileInfoLookupTable::Create(
 (const unsigned char *)F.HeaderFileInfoTableData + Record[0],
+(const unsigned char *)F.HeaderFileInfoTableData + 
sizeof(uint32_t),
 (const unsigned char *)F.HeaderFileInfoTableData,
 HeaderFileInfoTrait(*this, F));
 
@@ -6831,43 +6832,60 @@ std::optional 
ASTReader::isPreprocessedEntityInFileID(unsigned Index,
 return false;
 }
 
-namespace {
-
-  /// Visitor used to search for information about a header file.
-  class HeaderFileInfoVisitor {
-  FileEntryRef FE;
-std::optional HFI;
-
-  public:
-explicit HeaderFileInfoVisitor(FileEntryRef FE) : FE(FE) {}
-
-bool operator()(ModuleFile &M) {
-  HeaderFileInfoLookupTable *Table
-= static_cast(M.HeaderFileInfoTable);
-  if (!Table)
-return false;
+static void mergeHeaderFileInfoModuleBits(HeaderFileInfo &HFI,
+  bool isModuleHeader,
+  bool isTextualModuleHeader) {
+  HFI.isModuleHeader |= isModuleHeader;
+  if (HFI.isModuleHeader)
+HFI.isTextualModuleHeader = false;
+  else
+HFI.isTextualModuleHeader |= isTextualModuleHeader;
+}
 
-  // Look in the on-disk hash table for an entry for this file name.
-  HeaderFileInfoLookupTable::iterator Pos = Table->find(FE);
-  if (Pos == Table->end())
-return false;
+/// Merge the header file info provided by \p OtherHFI into the current
+/// header file info (\p HFI)
+static void mergeHeaderFileInfo(HeaderFileInfo &HFI,
+const HeaderFileInfo &OtherHFI) {
+  assert(OtherHFI.External && "expected to merge external HFI");
 
-  HFI = *Pos;
-  return true;
-}
+  HFI.isImport |= OtherHFI.isImport;
+  HFI.isPragmaOnce |= OtherHFI.isPragmaOnce;
+  mergeHeaderFileInfoModuleBits(HFI, OtherHFI.isModuleHeader,
+OtherHFI.isTextualModuleHeader);
 
-std::optional getHeaderFileInfo() const { return HFI; }
-  };
+  if (!HFI.LazyControllingMacro.isValid())
+HFI.LazyControllingMacro = OtherHFI.LazyControllingMacro;
 
-} // namespace
+  HFI.DirInfo = OtherHFI.DirInfo;
+  HFI.External = (!HFI.IsValid || HFI.External);
+  HFI.IsValid = true;
+}
 
 HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
-  HeaderFileInfoVisitor Visitor(FE);
-  ModuleMgr.visit(Visitor);
-  if (std::optional HFI = Visitor.getHeaderFileInfo())
-  return *HFI;
-
-  return HeaderFileInfo();
+  for (auto Iter = ModuleMgr.begin() + HeaderFileInfoIdx, End = 
ModuleMgr.end();
+   Iter != End; ++Iter) {
+if (auto *Table = static_cast(
+Iter->HeaderFileInfoTable)) {
+  auto &Info = Table->getInfoObj();
+  for (auto Iter = Table->data_begin(), End = Table->data_end();
+   Iter != End; ++Iter) {
+const auto *Item = Iter.getItem();
+// Determine the length of the key and the data.
+const auto& [KeyLen, DataLen] =
+HeaderFileInfoTrait::ReadKeyDataLength(Item);
+// Read the key.
+const auto &Key = Info.ReadKey(Item, KeyLen);
+if (auto EKey = Info.getFile(Key)) {
+  auto Data = Info.ReadData(Key, Item + KeyLen, DataLen);
+  auto [Iter, Inserted] = HeaderFileInfoLookup.try_emplace(*EKey, 
Data);
+  if (!Inserted)
+mergeHeaderFileInfo(Iter->second, Data);
+}
+  }
+}
+  }
+  HeaderFileInfoIdx = ModuleMgr.size();
+  return HeaderFileInfoLookup.lookup(FE);
 }
 
 void ASTReader::ReadPragmaDiagnosticMappings(D

[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread via cfe-commits

github-actions[bot] wrote:




:warning: C/C++ code formatter, clang-format found issues in your code. 
:warning:



You can test this locally with the following command:


``bash
git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- 
clang/include/clang/Serialization/ASTReader.h 
clang/lib/Serialization/ASTReader.cpp 
clang/lib/Serialization/ASTReaderInternals.h
``





View the diff from clang-format here.


``diff
diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
index 15ea6b25d..f90f4b979 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -6864,14 +6864,15 @@ static void mergeHeaderFileInfo(HeaderFileInfo &HFI,
 HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
   for (auto Iter = ModuleMgr.begin() + HeaderFileInfoIdx, End = 
ModuleMgr.end();
Iter != End; ++Iter) {
-if (auto *Table =
-static_cast(Iter->HeaderFileInfoTable)) {
-  auto& Info = Table->getInfoObj();
+if (auto *Table = static_cast(
+Iter->HeaderFileInfoTable)) {
+  auto &Info = Table->getInfoObj();
   for (auto Iter = Table->data_begin(), End = Table->data_end();
Iter != End; ++Iter) {
 const auto *Item = Iter.getItem();
 // Determine the length of the key and the data.
-const auto& [KeyLen, DataLen] = 
HeaderFileInfoTrait::ReadKeyDataLength(Item);
+const auto &[KeyLen, DataLen] =
+HeaderFileInfoTrait::ReadKeyDataLength(Item);
 // Read the key.
 const auto &Key = Info.ReadKey(Item, KeyLen);
 if (auto EKey = Info.getFile(Key)) {

``




https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark converted_to_draft 
https://github.com/llvm/llvm-project/pull/140867
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-clang

Author: Michael Park (mpark)


Changes

This is a WIP prototype in trying to solve #140860.

The approach explored here is to fully load the on-disk hash tables in each of 
the PCMs into memory, so that we perform `N` (# of includes) hash table 
look-ups, rather than `N` (# of includes) * `M` (# of loaded modules) look-ups.

The behavior is still somewhat lazy, where the loading doesn't happen until the 
first `GetHeaderFileInfo` request. We keep track of the modules that have 
already been processed, so that only newly added modules get added to the 
in-memory hash table.

The following test case takes **14s** before this change, and **0.18s** with 
the change.

```bash
#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-1}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES); do > m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 
-fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do > h${i}.h; done

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; 
done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" >> a.cpp; 
done
echo "int main() {}" >> a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
$(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; 
done)
```

Theoretically, this would do more work in scenarios where there are only a few 
modules, each with large header search tables. I tested this out with another 
script. This time, it produces 10 header units, each with 10,000 header 
includes in each of them. The main file `import`s the 10 header units. This 
loads 100,010 entries into the in-memory hash table, but still only takes 
**0.15s**. However, I'm not sure whether this would be acceptable, and if not, 
what might be a better strategy here.

```
#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-10}
NUM_HEADERS=${NUM_HEADERS:-1}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES)
do
> m${i}.h
for j in $(seq 1 $NUM_HEADERS)
do
> h${i}-${j}.h
echo "#include \"h${i}-${j}.h\"" >> m${i}.h
done
done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 
-fmodule-header m{}.h"

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; 
done

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
$(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; 
done)
```

## Other Considerations
- I've also tried doing the preloading at load time, the way we do for source 
locations and interesting identifiers in 
[ASTReader::ReadAST](https://github.com/llvm/llvm-project/blob/dec214d5c638302b92fa1cfe0582531cd58a7f6d/clang/lib/Serialization/ASTReader.cpp#L4800).
 This works just as well, it's slightly more eager than the approach taken 
here, but the effect is similar.
- I tried using 
[`MultiOnDiskHashTable`](https://github.com/llvm/llvm-project/blob/dec214d5c638302b92fa1cfe0582531cd58a7f6d/clang/lib/Serialization/MultiOnDiskHashTable.h#L40)
 to pull together all of the hash tables, since it would handle merging and 
loading into memory for performance reasons. It turns out to be kind of a pain 
to use with `HeaderFileInfoTrait`. More importantly, we can't really afford to 
key by the internal key type since that only hashes on the size of the file and 
the hash collision gets pretty bad. Moreover, the merging and condensing 
strategy seems rather simple.

---
Full diff: https://github.com/llvm/llvm-project/pull/140867.diff


3 Files Affected:

- (modified) clang/include/clang/Serialization/ASTReader.h (+4) 
- (modified) clang/lib/Serialization/ASTReader.cpp (+48-31) 
- (modified) clang/lib/Serialization/ASTReaderInternals.h (+1-2) 


``diff
diff --git a/clang/include/clang/Serialization/ASTReader.h 
b/clang/include/clang/Serialization/ASTReader.h
index 57b0266af26bb..9b63ad1a7cde7 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;
+
+  llvm::DenseMap HeaderFileInfoLookup;
+
   /// Map from a DeclContext to its lookup tables.
   llvm::DenseMap Lookups;
diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
index d068f5e163176..15ea6b25d1d92 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -4155,6 +4155,7 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
   if (Record[0]) {
 F.HeaderFileInfoTable = HeaderFileInfoLookupTable::Create(
 (const unsigned char *)F.HeaderFileInfoTableData + Record[0],
+(const unsigned char *)F.HeaderFileInfoTableData + 
sizeof(uint32_t)

[clang] [WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. (PR #140867)

2025-05-21 Thread Michael Park via cfe-commits

https://github.com/mpark created 
https://github.com/llvm/llvm-project/pull/140867

This is a WIP prototype in trying to solve #140860.

The approach explored here is to fully load the on-disk hash tables in each of 
the PCMs into memory, so that we perform `N` (# of includes) hash table 
look-ups, rather than `N` (# of includes) * `M` (# of loaded modules) look-ups.

The behavior is still somewhat lazy, where the loading doesn't happen until the 
first `GetHeaderFileInfo` request. We keep track of the modules that have 
already been processed, so that only newly added modules get added to the 
in-memory hash table.

The following test case takes **14s** before this change, and **0.18s** with 
the change.

```bash
#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-1}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES); do > m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 
-fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do > h${i}.h; done

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" >> a.cpp; done
echo "int main() {}" >> a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
$(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; 
done)
```

Theoretically, this would do more work in scenarios where there are only a few 
modules, each with large header search tables. I tested this out with another 
script. This time, it produces 10 header units, each with 10,000 header 
includes in each of them. The main file `import`s the 10 header units. This 
loads 100,010 entries into the in-memory hash table, but still only takes 
**0.15s**. However, I'm not sure whether this would be acceptable, and if not, 
what might be a better strategy here.

```
#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-10}
NUM_HEADERS=${NUM_HEADERS:-1}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES)
do
> m${i}.h
for j in $(seq 1 $NUM_HEADERS)
do
> h${i}-${j}.h
echo "#include \"h${i}-${j}.h\"" >> m${i}.h
done
done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 
-fmodule-header m{}.h"

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; done

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
$(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; 
done)
```

## Other Considerations
- I've also tried doing the preloading at load time, the way we do for source 
locations and interesting identifiers in 
[ASTReader::ReadAST](https://github.com/llvm/llvm-project/blob/dec214d5c638302b92fa1cfe0582531cd58a7f6d/clang/lib/Serialization/ASTReader.cpp#L4800).
 This works just as well, it's slightly more eager than the approach taken 
here, but the effect is similar.
- I tried using 
[`MultiOnDiskHashTable`](https://github.com/llvm/llvm-project/blob/dec214d5c638302b92fa1cfe0582531cd58a7f6d/clang/lib/Serialization/MultiOnDiskHashTable.h#L40)
 to pull together all of the hash tables, since it would handle merging and 
loading into memory for performance reasons. It turns out to be kind of a pain 
to use with `HeaderFileInfoTrait`. More importantly, we can't really afford to 
key by the internal key type since that only hashes on the size of the file and 
the hash collision gets pretty bad. Moreover, the merging and condensing 
strategy seems rather simple.

>From eed4b121bfcf9fee63538710b1958a3044358c63 Mon Sep 17 00:00:00 2001
From: Michael Park 
Date: Tue, 20 May 2025 17:18:36 -0700
Subject: [PATCH] Lazily, but fully load 'HeaderFileInfo' table into memory.

---
 clang/include/clang/Serialization/ASTReader.h |  4 +
 clang/lib/Serialization/ASTReader.cpp | 79 +++
 clang/lib/Serialization/ASTReaderInternals.h  |  3 +-
 3 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/clang/include/clang/Serialization/ASTReader.h 
b/clang/include/clang/Serialization/ASTReader.h
index 57b0266af26bb..9b63ad1a7cde7 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -654,6 +654,10 @@ class ASTReader
   /// Map from the TU to its lexical contents from each module file.
   std::vector> TULexicalDecls;
 
+  unsigned HeaderFileInfoIdx = 0;
+
+  llvm::DenseMap HeaderFileInfoLookup;
+
   /// Map from a DeclContext to its lookup tables.
   llvm::DenseMap Lookups;
diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
index d068f5e163176..15ea6b25d1d92 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -4155,6 +4155,7 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
   if (Record[0]) {
 F.HeaderFileInfoTable = HeaderFileInfoLookupTable::Create(
 (co