[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-23 Thread Mircea Trofin via llvm-branch-commits

https://github.com/mtrofin approved this pull request.

doc nit, otherwise lgtm

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-23 Thread Mircea Trofin via llvm-branch-commits

https://github.com/mtrofin edited 
https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-23 Thread Mircea Trofin via llvm-branch-commits


@@ -448,7 +448,10 @@ downstream tasks, including ML-guided compiler 
optimizations.
 
 The core components are:
   - **Vocabulary**: A mapping from IR entities (opcodes, types, etc.) to their
-vector representations. This is managed by ``IR2VecVocabAnalysis``.
+vector representations. This is managed by ``IR2VecVocabAnalysis``. The 
+vocabulary (.json file) contains three sections -- Opcodes, Types, and 
+Arguments, each containing the representations of the corresponding 
+entities.

mtrofin wrote:

document that the sections are mandatory, but the order in which they appear 
isn't

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread S. VenkataKeerthy via llvm-branch-commits


@@ -259,32 +306,40 @@ Error IR2VecVocabAnalysis::readVocabulary() {
 return createFileError(VocabFile, BufOrError.getError());
 
   auto Content = BufOrError.get()->getBuffer();
-  json::Path::Root Path("");
+
   Expected ParsedVocabValue = json::parse(Content);
   if (!ParsedVocabValue)
 return ParsedVocabValue.takeError();
 
-  bool Res = json::fromJSON(*ParsedVocabValue, Vocabulary, Path);
-  if (!Res)
-return createStringError(errc::illegal_byte_sequence,
- "Unable to parse the vocabulary");
+  ir2vec::Vocab OpcodeVocab, TypeVocab, ArgVocab;
+  unsigned OpcodeDim, TypeDim, ArgDim;
+  if (auto Err = parseVocabSection("Opcodes", *ParsedVocabValue, OpcodeVocab,

svkeerthy wrote:

Correct. Will put it in the doc.

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread S. VenkataKeerthy via llvm-branch-commits


@@ -104,7 +106,10 @@ MODULE_PASS("lower-ifunc", LowerIFuncPass())
 MODULE_PASS("simplify-type-tests", SimplifyTypeTestsPass())
 MODULE_PASS("lowertypetests", LowerTypeTestsPass())
 MODULE_PASS("fatlto-cleanup", FatLtoCleanup())
-MODULE_PASS("pgo-force-function-attrs", PGOForceFunctionAttrsPass(PGOOpt ? 
PGOOpt->ColdOptType : PGOOptions::ColdFuncOpt::Default))
+MODULE_PASS("pgo-force-function-attrs",
+PGOForceFunctionAttrsPass(PGOOpt

svkeerthy wrote:

Yeah, will do. Missed the unrelated formatting changes.

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread Mircea Trofin via llvm-branch-commits


@@ -104,7 +106,10 @@ MODULE_PASS("lower-ifunc", LowerIFuncPass())
 MODULE_PASS("simplify-type-tests", SimplifyTypeTestsPass())
 MODULE_PASS("lowertypetests", LowerTypeTestsPass())
 MODULE_PASS("fatlto-cleanup", FatLtoCleanup())
-MODULE_PASS("pgo-force-function-attrs", PGOForceFunctionAttrsPass(PGOOpt ? 
PGOOpt->ColdOptType : PGOOptions::ColdFuncOpt::Default))
+MODULE_PASS("pgo-force-function-attrs",
+PGOForceFunctionAttrsPass(PGOOpt

mtrofin wrote:

can you make the unrelated stylistic changes to this file in a separate patch?

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread Mircea Trofin via llvm-branch-commits

https://github.com/mtrofin edited 
https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread Mircea Trofin via llvm-branch-commits


@@ -259,32 +306,40 @@ Error IR2VecVocabAnalysis::readVocabulary() {
 return createFileError(VocabFile, BufOrError.getError());
 
   auto Content = BufOrError.get()->getBuffer();
-  json::Path::Root Path("");
+
   Expected ParsedVocabValue = json::parse(Content);
   if (!ParsedVocabValue)
 return ParsedVocabValue.takeError();
 
-  bool Res = json::fromJSON(*ParsedVocabValue, Vocabulary, Path);
-  if (!Res)
-return createStringError(errc::illegal_byte_sequence,
- "Unable to parse the vocabulary");
+  ir2vec::Vocab OpcodeVocab, TypeVocab, ArgVocab;
+  unsigned OpcodeDim, TypeDim, ArgDim;
+  if (auto Err = parseVocabSection("Opcodes", *ParsedVocabValue, OpcodeVocab,

mtrofin wrote:

This changes the format, best to also update the doc.

Also, this means the sections must all be present, even if empty, correct? 
SGTM, just something worth spelling out.

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread Mircea Trofin via llvm-branch-commits


@@ -259,32 +306,40 @@ Error IR2VecVocabAnalysis::readVocabulary() {
 return createFileError(VocabFile, BufOrError.getError());
 
   auto Content = BufOrError.get()->getBuffer();
-  json::Path::Root Path("");
+
   Expected ParsedVocabValue = json::parse(Content);
   if (!ParsedVocabValue)
 return ParsedVocabValue.takeError();
 
-  bool Res = json::fromJSON(*ParsedVocabValue, Vocabulary, Path);
-  if (!Res)
-return createStringError(errc::illegal_byte_sequence,
- "Unable to parse the vocabulary");
+  ir2vec::Vocab OpcodeVocab, TypeVocab, ArgVocab;
+  unsigned OpcodeDim, TypeDim, ArgDim;

mtrofin wrote:

Initialize at declaration

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-13 Thread Mircea Trofin via llvm-branch-commits


@@ -234,6 +237,8 @@ class IR2VecVocabResult {
 class IR2VecVocabAnalysis : public AnalysisInfoMixin {
   ir2vec::Vocab Vocabulary;
   Error readVocabulary();
+  Error parseVocabSection(const char *Key, const json::Value ParsedVocabValue,

mtrofin wrote:

s/const char*/StringRef

s/const json::Value/const json::Value&

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-12 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy edited 
https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-12 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy commented:

@albertcohen - Please have a look. I am not able to add you as reviewer.

https://github.com/llvm/llvm-project/pull/143986
___
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [IR2Vec] Scale embeddings once in vocab analysis instead of repetitive scaling (PR #143986)

2025-06-12 Thread via llvm-branch-commits

llvmbot wrote:




@llvm/pr-subscribers-llvm-analysis

Author: S. VenkataKeerthy (svkeerthy)


Changes

Changes to scale opcodes, types and args once in `IR2VecVocabAnalysis` so that 
we can avoid scaling each time while computing embeddings. This PR refactors 
the vocabulary to explicitly define 3 sections---Opcodes, Types, and 
Arguments---used for computing Embeddings. 

(Tracking issue - #141817 ; partly fixes - #141832)


---

Patch is 149.98 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/143986.diff


16 Files Affected:

- (modified) llvm/include/llvm/Analysis/IR2Vec.h (+15-1) 
- (modified) llvm/lib/Analysis/IR2Vec.cpp (+102-39) 
- (modified) llvm/lib/Analysis/models/seedEmbeddingVocab75D.json (+70-63) 
- (modified) llvm/lib/Passes/PassRegistry.def (+24-17) 
- (added) llvm/test/Analysis/IR2Vec/Inputs/dummy_2D_vocab.json (+11) 
- (modified) llvm/test/Analysis/IR2Vec/Inputs/dummy_3D_vocab.json (+13-5) 
- (modified) llvm/test/Analysis/IR2Vec/Inputs/dummy_5D_vocab.json (+15-9) 
- (added) llvm/test/Analysis/IR2Vec/Inputs/incorrect_vocab1.json (+11) 
- (added) llvm/test/Analysis/IR2Vec/Inputs/incorrect_vocab2.json (+12) 
- (added) llvm/test/Analysis/IR2Vec/Inputs/incorrect_vocab3.json (+12) 
- (added) llvm/test/Analysis/IR2Vec/Inputs/incorrect_vocab4.json (+16) 
- (modified) llvm/test/Analysis/IR2Vec/basic.ll (+13-1) 
- (added) llvm/test/Analysis/IR2Vec/dbg-inst.ll (+13) 
- (added) llvm/test/Analysis/IR2Vec/unreachable.ll (+42) 
- (added) llvm/test/Analysis/IR2Vec/vocab-test.ll (+20) 
- (modified) llvm/unittests/Analysis/IR2VecTest.cpp (+6) 


``diff
diff --git a/llvm/include/llvm/Analysis/IR2Vec.h 
b/llvm/include/llvm/Analysis/IR2Vec.h
index de67955d85d7c..f1aaf4cd2e013 100644
--- a/llvm/include/llvm/Analysis/IR2Vec.h
+++ b/llvm/include/llvm/Analysis/IR2Vec.h
@@ -108,6 +108,7 @@ struct Embedding {
   /// Arithmetic operators
   Embedding &operator+=(const Embedding &RHS);
   Embedding &operator-=(const Embedding &RHS);
+  Embedding &operator*=(double Factor);
 
   /// Adds Src Embedding scaled by Factor with the called Embedding.
   /// Called_Embedding += Src * Factor
@@ -116,6 +117,8 @@ struct Embedding {
   /// Returns true if the embedding is approximately equal to the RHS embedding
   /// within the specified tolerance.
   bool approximatelyEquals(const Embedding &RHS, double Tolerance = 1e-6) 
const;
+
+  void print(raw_ostream &OS) const;
 };
 
 using InstEmbeddingsMap = DenseMap;
@@ -234,6 +237,8 @@ class IR2VecVocabResult {
 class IR2VecVocabAnalysis : public AnalysisInfoMixin {
   ir2vec::Vocab Vocabulary;
   Error readVocabulary();
+  Error parseVocabSection(const char *Key, const json::Value ParsedVocabValue,
+  ir2vec::Vocab &TargetVocab, unsigned &Dim);
   void emitError(Error Err, LLVMContext &Ctx);
 
 public:
@@ -249,7 +254,6 @@ class IR2VecVocabAnalysis : public 
AnalysisInfoMixin {
 /// functions.
 class IR2VecPrinterPass : public PassInfoMixin {
   raw_ostream &OS;
-  void printVector(const ir2vec::Embedding &Vec) const;
 
 public:
   explicit IR2VecPrinterPass(raw_ostream &OS) : OS(OS) {}
@@ -257,6 +261,16 @@ class IR2VecPrinterPass : public 
PassInfoMixin {
   static bool isRequired() { return true; }
 };
 
+/// This pass prints the embeddings in the vocabulary
+class IR2VecVocabPrinterPass : public PassInfoMixin {
+  raw_ostream &OS;
+
+public:
+  explicit IR2VecVocabPrinterPass(raw_ostream &OS) : OS(OS) {}
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &MAM);
+  static bool isRequired() { return true; }
+};
+
 } // namespace llvm
 
 #endif // LLVM_ANALYSIS_IR2VEC_H
diff --git a/llvm/lib/Analysis/IR2Vec.cpp b/llvm/lib/Analysis/IR2Vec.cpp
index fa38c35796a0e..f51d3252d6606 100644
--- a/llvm/lib/Analysis/IR2Vec.cpp
+++ b/llvm/lib/Analysis/IR2Vec.cpp
@@ -85,6 +85,12 @@ Embedding &Embedding::operator-=(const Embedding &RHS) {
   return *this;
 }
 
+Embedding &Embedding::operator*=(double Factor) {
+  std::transform(this->begin(), this->end(), this->begin(),
+ [Factor](double Elem) { return Elem * Factor; });
+  return *this;
+}
+
 Embedding &Embedding::scaleAndAdd(const Embedding &Src, float Factor) {
   assert(this->size() == Src.size() && "Vectors must have the same dimension");
   for (size_t Itr = 0; Itr < this->size(); ++Itr)
@@ -101,6 +107,13 @@ bool Embedding::approximatelyEquals(const Embedding &RHS,
   return true;
 }
 
+void Embedding::print(raw_ostream &OS) const {
+  OS << " [";
+  for (const auto &Elem : Data)
+OS << " " << format("%.2f", Elem) << " ";
+  OS << "]\n";
+}
+
 // 
==--===//
 // Embedder and its subclasses
 
//===--===//
@@ -196,18 +209,12 @@ void SymbolicEmbedder::computeEmbeddings(const BasicBlock 
&BB) const {
   for (const auto &I : BB.instructionsWithoutDebug()) {
 Embedding InstVector(Dimension, 0);
 
-const