https://github.com/svkeerthy updated https://github.com/llvm/llvm-project/pull/149214
>From db6db83e5ee2ce1503bd041cbb975b36c0fc59c9 Mon Sep 17 00:00:00 2001 From: svkeerthy <venkatakeer...@google.com> Date: Wed, 16 Jul 2025 22:03:56 +0000 Subject: [PATCH 1/2] revamp-triplet-gen --- llvm/docs/CommandGuide/llvm-ir2vec.rst | 79 ++++- llvm/test/tools/llvm-ir2vec/entities.ll | 95 ++++++ llvm/test/tools/llvm-ir2vec/triplets.ll | 51 ++- llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp | 204 ++++++++---- .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++++++++++++++++++ 5 files changed, 627 insertions(+), 93 deletions(-) create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst index 13fe4996b968f..56ece4f509f6e 100644 --- a/llvm/docs/CommandGuide/llvm-ir2vec.rst +++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst @@ -13,17 +13,21 @@ DESCRIPTION :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It generates IR2Vec embeddings for LLVM IR and supports triplet generation -for vocabulary training. It provides two main operation modes: +for vocabulary training. It provides three main operation modes: -1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary +1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary training from LLVM IR. -2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary +2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary + training. + +3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary at different granularity levels (instruction, basic block, or function). The tool is designed to facilitate machine learning applications that work with LLVM IR by converting the IR into numerical representations that can be used by -ML models. +ML models. The triplet mode generates numeric IDs directly instead of string +triplets, streamlining the training data preparation workflow. .. note:: @@ -34,18 +38,46 @@ ML models. OPERATION MODES --------------- +Triplet Generation and Entity Mapping Modes are used for preparing +vocabulary and training data for knowledge graph embeddings. The Embedding Mode +is used for generating embeddings from LLVM IR using a pre-trained vocabulary. + +The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR +by modeling the relationships between opcodes, types, and operands as a knowledge +graph. For this purpose, Triplet Generation and Entity Mapping Modes generate +triplets and entity mappings in the standard format used for knowledge graph +embedding training (see +<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format> +for details). + Triplet Generation Mode ~~~~~~~~~~~~~~~~~~~~~~~ -In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets -consisting of opcodes, types, and operands. These triplets can be used to train -vocabularies for embedding generation. +In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric +triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets +are generated in train2id format. The tool outputs numeric IDs directly using +the ir2vec::Vocabulary mapping infrastructure, eliminating the need for +string-to-ID preprocessing. + +Usage: + +.. code-block:: bash + + llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt + +Entity Mapping Generation Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by +IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, +types, and operands) with their corresponding numeric IDs, and is not specific for +an LLVM IR file. Usage: .. code-block:: bash - llvm-ir2vec --mode=triplets input.bc -o triplets.txt + llvm-ir2vec --mode=entities -o entity2id.txt Embedding Generation Mode ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -67,6 +99,7 @@ OPTIONS Specify the operation mode. Valid values are: * ``triplets`` - Generate triplets for vocabulary training + * ``entities`` - Generate entity mappings for vocabulary training * ``embeddings`` - Generate embeddings using trained vocabulary (default) .. option:: --level=<level> @@ -115,7 +148,7 @@ OPTIONS ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding - mode. These options are ignored in triplet mode. + mode. These options are ignored in triplet and entity modes. INPUT FILE FORMAT ----------------- @@ -129,14 +162,34 @@ OUTPUT FORMAT Triplet Mode Output ~~~~~~~~~~~~~~~~~~~ -In triplet mode, the output consists of lines containing space-separated triplets: +In triplet mode, the output consists of numeric triplets in train2id format with +metadata headers. The format includes: + +.. code-block:: text + + MAX_RELATIONS=<max_relations_count> + <head_entity_id> <tail_entity_id> <relation_id> + <head_entity_id> <tail_entity_id> <relation_id> + ... + +Each line after the metadata header represents one instruction relationship, +with numeric IDs for head entity, relation, and tail entity. The metadata +header (MAX_RELATIONS) provides counts for post-processing and training setup. + +Entity Mode Output +~~~~~~~~~~~~~~~~~~ + +In entity mode, the output consists of entity mapping in the format: .. code-block:: text - <opcode> <type> <operand1> <operand2> ... + <total_entities> + <entity_string> <numeric_id> + <entity_string> <numeric_id> + ... -Each line represents the information of one instruction, with the opcode, type, -and operands. +The first line contains the total number of entities, followed by one entity +mapping per line with tab-separated entity string and numeric ID. Embedding Mode Output ~~~~~~~~~~~~~~~~~~~~~ diff --git a/llvm/test/tools/llvm-ir2vec/entities.ll b/llvm/test/tools/llvm-ir2vec/entities.ll new file mode 100644 index 0000000000000..57c3d6fa6d6c4 --- /dev/null +++ b/llvm/test/tools/llvm-ir2vec/entities.ll @@ -0,0 +1,95 @@ +; RUN: llvm-ir2vec --mode=entities | FileCheck %s + +CHECK: 92 +CHECK-NEXT: Ret 0 +CHECK-NEXT: Br 1 +CHECK-NEXT: Switch 2 +CHECK-NEXT: IndirectBr 3 +CHECK-NEXT: Invoke 4 +CHECK-NEXT: Resume 5 +CHECK-NEXT: Unreachable 6 +CHECK-NEXT: CleanupRet 7 +CHECK-NEXT: CatchRet 8 +CHECK-NEXT: CatchSwitch 9 +CHECK-NEXT: CallBr 10 +CHECK-NEXT: FNeg 11 +CHECK-NEXT: Add 12 +CHECK-NEXT: FAdd 13 +CHECK-NEXT: Sub 14 +CHECK-NEXT: FSub 15 +CHECK-NEXT: Mul 16 +CHECK-NEXT: FMul 17 +CHECK-NEXT: UDiv 18 +CHECK-NEXT: SDiv 19 +CHECK-NEXT: FDiv 20 +CHECK-NEXT: URem 21 +CHECK-NEXT: SRem 22 +CHECK-NEXT: FRem 23 +CHECK-NEXT: Shl 24 +CHECK-NEXT: LShr 25 +CHECK-NEXT: AShr 26 +CHECK-NEXT: And 27 +CHECK-NEXT: Or 28 +CHECK-NEXT: Xor 29 +CHECK-NEXT: Alloca 30 +CHECK-NEXT: Load 31 +CHECK-NEXT: Store 32 +CHECK-NEXT: GetElementPtr 33 +CHECK-NEXT: Fence 34 +CHECK-NEXT: AtomicCmpXchg 35 +CHECK-NEXT: AtomicRMW 36 +CHECK-NEXT: Trunc 37 +CHECK-NEXT: ZExt 38 +CHECK-NEXT: SExt 39 +CHECK-NEXT: FPToUI 40 +CHECK-NEXT: FPToSI 41 +CHECK-NEXT: UIToFP 42 +CHECK-NEXT: SIToFP 43 +CHECK-NEXT: FPTrunc 44 +CHECK-NEXT: FPExt 45 +CHECK-NEXT: PtrToInt 46 +CHECK-NEXT: IntToPtr 47 +CHECK-NEXT: BitCast 48 +CHECK-NEXT: AddrSpaceCast 49 +CHECK-NEXT: CleanupPad 50 +CHECK-NEXT: CatchPad 51 +CHECK-NEXT: ICmp 52 +CHECK-NEXT: FCmp 53 +CHECK-NEXT: PHI 54 +CHECK-NEXT: Call 55 +CHECK-NEXT: Select 56 +CHECK-NEXT: UserOp1 57 +CHECK-NEXT: UserOp2 58 +CHECK-NEXT: VAArg 59 +CHECK-NEXT: ExtractElement 60 +CHECK-NEXT: InsertElement 61 +CHECK-NEXT: ShuffleVector 62 +CHECK-NEXT: ExtractValue 63 +CHECK-NEXT: InsertValue 64 +CHECK-NEXT: LandingPad 65 +CHECK-NEXT: Freeze 66 +CHECK-NEXT: FloatTy 67 +CHECK-NEXT: FloatTy 68 +CHECK-NEXT: FloatTy 69 +CHECK-NEXT: FloatTy 70 +CHECK-NEXT: FloatTy 71 +CHECK-NEXT: FloatTy 72 +CHECK-NEXT: FloatTy 73 +CHECK-NEXT: VoidTy 74 +CHECK-NEXT: LabelTy 75 +CHECK-NEXT: MetadataTy 76 +CHECK-NEXT: UnknownTy 77 +CHECK-NEXT: TokenTy 78 +CHECK-NEXT: IntegerTy 79 +CHECK-NEXT: FunctionTy 80 +CHECK-NEXT: PointerTy 81 +CHECK-NEXT: StructTy 82 +CHECK-NEXT: ArrayTy 83 +CHECK-NEXT: VectorTy 84 +CHECK-NEXT: VectorTy 85 +CHECK-NEXT: PointerTy 86 +CHECK-NEXT: UnknownTy 87 +CHECK-NEXT: Function 88 +CHECK-NEXT: Pointer 89 +CHECK-NEXT: Constant 90 +CHECK-NEXT: Variable 91 diff --git a/llvm/test/tools/llvm-ir2vec/triplets.ll b/llvm/test/tools/llvm-ir2vec/triplets.ll index d1ef5b388e258..dcd1dc9afb478 100644 --- a/llvm/test/tools/llvm-ir2vec/triplets.ll +++ b/llvm/test/tools/llvm-ir2vec/triplets.ll @@ -24,15 +24,42 @@ entry: ret i32 %result } -; TRIPLETS: Add IntegerTy Variable Variable -; TRIPLETS-NEXT: Ret VoidTy Variable -; TRIPLETS-NEXT: Mul IntegerTy Variable Variable -; TRIPLETS-NEXT: Ret VoidTy Variable -; TRIPLETS-NEXT: Alloca PointerTy Constant -; TRIPLETS-NEXT: Alloca PointerTy Constant -; TRIPLETS-NEXT: Store VoidTy Variable Pointer -; TRIPLETS-NEXT: Store VoidTy Variable Pointer -; TRIPLETS-NEXT: Load IntegerTy Pointer -; TRIPLETS-NEXT: Load IntegerTy Pointer -; TRIPLETS-NEXT: Add IntegerTy Variable Variable -; TRIPLETS-NEXT: Ret VoidTy Variable +; TRIPLETS: MAX_RELATION=3 +; TRIPLETS-NEXT: 12 79 0 +; TRIPLETS-NEXT: 12 91 2 +; TRIPLETS-NEXT: 12 91 3 +; TRIPLETS-NEXT: 12 0 1 +; TRIPLETS-NEXT: 0 74 0 +; TRIPLETS-NEXT: 0 91 2 +; TRIPLETS-NEXT: 16 79 0 +; TRIPLETS-NEXT: 16 91 2 +; TRIPLETS-NEXT: 16 91 3 +; TRIPLETS-NEXT: 16 0 1 +; TRIPLETS-NEXT: 0 74 0 +; TRIPLETS-NEXT: 0 91 2 +; TRIPLETS-NEXT: 30 81 0 +; TRIPLETS-NEXT: 30 90 2 +; TRIPLETS-NEXT: 30 30 1 +; TRIPLETS-NEXT: 30 81 0 +; TRIPLETS-NEXT: 30 90 2 +; TRIPLETS-NEXT: 30 32 1 +; TRIPLETS-NEXT: 32 74 0 +; TRIPLETS-NEXT: 32 91 2 +; TRIPLETS-NEXT: 32 89 3 +; TRIPLETS-NEXT: 32 32 1 +; TRIPLETS-NEXT: 32 74 0 +; TRIPLETS-NEXT: 32 91 2 +; TRIPLETS-NEXT: 32 89 3 +; TRIPLETS-NEXT: 32 31 1 +; TRIPLETS-NEXT: 31 79 0 +; TRIPLETS-NEXT: 31 89 2 +; TRIPLETS-NEXT: 31 31 1 +; TRIPLETS-NEXT: 31 79 0 +; TRIPLETS-NEXT: 31 89 2 +; TRIPLETS-NEXT: 31 12 1 +; TRIPLETS-NEXT: 12 79 0 +; TRIPLETS-NEXT: 12 91 2 +; TRIPLETS-NEXT: 12 91 3 +; TRIPLETS-NEXT: 12 0 1 +; TRIPLETS-NEXT: 0 74 0 +; TRIPLETS-NEXT: 0 91 2 diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp index e3aa7bd1b3b1e..24ff278967d8b 100644 --- a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp +++ b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp @@ -9,13 +9,20 @@ /// \file /// This file implements the IR2Vec embedding generation tool. /// -/// This tool provides two main functionalities: +/// This tool provides three main modes: /// /// 1. Triplet Generation Mode (--mode=triplets): -/// Generates triplets (opcode, type, operands) for vocabulary training. -/// Usage: llvm-ir2vec --mode=triplets input.bc -o triplets.txt +/// Generates numeric triplets (head, tail, relation) for vocabulary +/// training. Output format: MAX_RELATION=N header followed by +/// head\ttail\trelation lines. Relations: 0=Type, 1=Next, 2+=Arg0,Arg1,... +/// Usage: llvm-ir2vec --mode=triplets input.bc -o train2id.txt /// -/// 2. Embedding Generation Mode (--mode=embeddings): +/// 2. Entities Generation Mode (--mode=entities): +/// Generates entity mappings for vocabulary training. +/// Output format: <total_entities> header followed by entity\tid lines. +/// Usage: llvm-ir2vec --mode=entities input.bc -o entity2id.txt +/// +/// 3. Embedding Generation Mode (--mode=embeddings): /// Generates IR2Vec embeddings using a trained vocabulary. /// Usage: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json /// --level=func input.bc -o embeddings.txt Levels: --level=inst @@ -60,16 +67,19 @@ static cl::opt<std::string> OutputFilename("o", cl::desc("Output filename"), enum ToolMode { TripletMode, // Generate triplets for vocabulary training + EntityMode, // Generate entity mappings for vocabulary training EmbeddingMode // Generate embeddings using trained vocabulary }; -static cl::opt<ToolMode> - Mode("mode", cl::desc("Tool operation mode:"), - cl::values(clEnumValN(TripletMode, "triplets", - "Generate triplets for vocabulary training"), - clEnumValN(EmbeddingMode, "embeddings", - "Generate embeddings using trained vocabulary")), - cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory)); +static cl::opt<ToolMode> Mode( + "mode", cl::desc("Tool operation mode:"), + cl::values(clEnumValN(TripletMode, "triplets", + "Generate triplets for vocabulary training"), + clEnumValN(EntityMode, "entities", + "Generate entity mappings for vocabulary training"), + clEnumValN(EmbeddingMode, "embeddings", + "Generate embeddings using trained vocabulary")), + cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory)); static cl::opt<std::string> FunctionName("function", cl::desc("Process specific function only"), @@ -94,6 +104,13 @@ static cl::opt<EmbeddingLevel> namespace { +/// Relation types for triplet generation +enum RelationType { + TypeRelation = 0, ///< Instruction to type relationship + NextRelation = 1, ///< Sequential instruction relationship + ArgRelation = 2 ///< Instruction to operand relationship (ArgRelation + N) +}; + /// Helper class for collecting IR triplets and generating embeddings class IR2VecTool { private: @@ -115,25 +132,96 @@ class IR2VecTool { return Vocab->isValid(); } - /// Generate triplets for the entire module + /// Generate triplets for the module + /// Output format: MAX_RELATION=N header followed by relationships void generateTriplets(raw_ostream &OS) const { - for (const Function &F : M) - generateTriplets(F, OS); + unsigned MaxRelation = NextRelation; // Track maximum relation ID + std::string Relationships; + raw_string_ostream RelOS(Relationships); + + for (const Function &F : M) { + unsigned FuncMaxRelation = generateTriplets(F, RelOS); + MaxRelation = std::max(MaxRelation, FuncMaxRelation); + } + + RelOS.flush(); + + // Write metadata header followed by relationships + OS << "MAX_RELATION=" << MaxRelation << '\n'; + OS << Relationships; } /// Generate triplets for a single function - void generateTriplets(const Function &F, raw_ostream &OS) const { + /// Returns the maximum relation ID used in this function + unsigned generateTriplets(const Function &F, raw_ostream &OS) const { if (F.isDeclaration()) - return; + return 0; + + unsigned MaxRelation = 1; + unsigned PrevOpcode = 0; + bool HasPrevOpcode = false; + + for (const BasicBlock &BB : F) { + for (const auto &I : BB.instructionsWithoutDebug()) { + unsigned Opcode = Vocabulary::getNumericID(I.getOpcode()); + unsigned TypeID = Vocabulary::getNumericID(I.getType()->getTypeID()); + + // Add "Next" relationship with previous instruction + if (HasPrevOpcode) { + OS << PrevOpcode << '\t' << Opcode << '\t' << NextRelation << '\n'; + LLVM_DEBUG(dbgs() + << Vocabulary::getVocabKeyForOpcode(PrevOpcode + 1) << '\t' + << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t' + << "Next\n"); + } - std::string LocalOutput; - raw_string_ostream LocalOS(LocalOutput); + // Add "Type" relationship + OS << Opcode << '\t' << TypeID << '\t' << TypeRelation << '\n'; + LLVM_DEBUG( + dbgs() << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t' + << Vocabulary::getVocabKeyForTypeID(I.getType()->getTypeID()) + << '\t' << "Type\n"); + + // Add "Arg" relationships + unsigned ArgIndex = 0; + for (const Use &U : I.operands()) { + unsigned OperandID = Vocabulary::getNumericID(U.get()); + unsigned RelationID = ArgRelation + ArgIndex; + OS << Opcode << '\t' << OperandID << '\t' << RelationID << '\n'; + + LLVM_DEBUG({ + StringRef OperandStr = Vocabulary::getVocabKeyForOperandKind( + Vocabulary::getOperandKind(U.get())); + dbgs() << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t' + << OperandStr << '\t' << "Arg" << ArgIndex << '\n'; + }); + + ArgIndex++; + } + // Only update MaxRelation if there were operands + if (ArgIndex > 0) { + MaxRelation = std::max(MaxRelation, ArgRelation + ArgIndex - 1); + } + PrevOpcode = Opcode; + HasPrevOpcode = true; + } + } - for (const BasicBlock &BB : F) - traverseBasicBlock(BB, LocalOS); + return MaxRelation; + } - LocalOS.flush(); - OS << LocalOutput; + /// Dump entity ID to string mappings + static void generateEntityMappings(raw_ostream &OS) { + // FIXME: Currently, the generated entity mappings are not one-to-one; + // Multiple TypeIDs map to same string key (Like Half, BFloat, etc. map to + // FloatTy). This would hinder learning good seed embeddings. + // We should fix this in the future by ensuring unique string keys either by + // post-processing here without changing the mapping in ir2vec::Vocabulary, + // or by changing the Vocabulary generation logic to ensure unique keys. + auto EntityLen = Vocabulary::expectedSize(); + OS << EntityLen << "\n"; + for (unsigned EntityID = 0; EntityID < EntityLen; ++EntityID) + OS << Vocabulary::getStringKey(EntityID) << '\t' << EntityID << '\n'; } /// Generate embeddings for the entire module @@ -197,31 +285,6 @@ class IR2VecTool { } } } - -private: - /// Process a single basic block for triplet generation - void traverseBasicBlock(const BasicBlock &BB, raw_string_ostream &OS) const { - // Consider only non-debug and non-pseudo instructions - for (const auto &I : BB.instructionsWithoutDebug()) { - StringRef OpcStr = Vocabulary::getVocabKeyForOpcode(I.getOpcode()); - StringRef TypeStr = - Vocabulary::getVocabKeyForTypeID(I.getType()->getTypeID()); - - OS << '\n' << OpcStr << ' ' << TypeStr << ' '; - - LLVM_DEBUG({ - I.print(dbgs()); - dbgs() << "\n"; - I.getType()->print(dbgs()); - dbgs() << " Type\n"; - }); - - for (const Use &U : I.operands()) - OS << Vocabulary::getVocabKeyForOperandKind( - Vocabulary::getOperandKind(U.get())) - << ' '; - } - } }; Error processModule(Module &M, raw_ostream &OS) { @@ -249,18 +312,7 @@ Error processModule(Module &M, raw_ostream &OS) { Tool.generateEmbeddings(OS); } } else { - // Triplet generation mode - no vocabulary needed - if (!FunctionName.empty()) - // Process single function - if (const Function *F = M.getFunction(FunctionName)) - Tool.generateTriplets(*F, OS); - else - return createStringError(errc::invalid_argument, - "Function '%s' not found", - FunctionName.c_str()); - else - // Process all functions - Tool.generateTriplets(OS); + Tool.generateTriplets(OS); } return Error::success(); } @@ -283,9 +335,32 @@ int main(int argc, char **argv) { "See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more " "information.\n"); + // Validate input file requirement + if (InputFilename.empty() && Mode != EntityMode) { + errs() << "Error: Input file (.bc/.ll) or stdin (-) is required\n"; + return 1; + } + // Validate command line options - if (Mode == TripletMode && Level.getNumOccurrences() > 0) - errs() << "Warning: --level option is ignored in triplet mode\n"; + if (Mode != EmbeddingMode) { + if (Level.getNumOccurrences() > 0) + errs() << "Warning: --level option is ignored\n"; + if (FunctionName.getNumOccurrences() > 0) + errs() << "Warning: --function option is ignored\n"; + } + + std::error_code EC; + raw_fd_ostream OS(OutputFilename, EC); + if (EC) { + errs() << "Error opening output file: " << EC.message() << "\n"; + return 1; + } + + if (Mode == EntityMode) { + // Just dump entity mappings without processing any IR + IR2VecTool::generateEntityMappings(OS); + return 0; + } // Parse the input LLVM IR file or stdin SMDiagnostic Err; @@ -300,13 +375,6 @@ int main(int argc, char **argv) { return 1; } - std::error_code EC; - raw_fd_ostream OS(OutputFilename, EC); - if (EC) { - errs() << "Error opening output file: " << EC.message() << "\n"; - return 1; - } - if (Error Err = processModule(*M, OS)) { handleAllErrors(std::move(Err), [&](const ErrorInfoBase &EIB) { errs() << "Error: " << EIB.message() << "\n"; diff --git a/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py b/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py new file mode 100644 index 0000000000000..0858d10ce0138 --- /dev/null +++ b/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py @@ -0,0 +1,291 @@ +# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +# See https://llvm.org/LICENSE.txt for license information. +# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +"""IR2Vec Triplet Generator + +Generates IR2Vec triplets by applying random optimization levels to LLVM IR files +and extracting triplets using llvm-ir2vec. Automatically generates preprocessed +files: entity2id.txt, relation2id.txt, and train2id.txt. + +Usage: + python generateTriplets.py <llvm_build_dir> <num_optimizations> <ll_file_list> <output_dir> +""" + +import argparse +import logging +import os +import random +import subprocess +import sys +from concurrent.futures import ThreadPoolExecutor, as_completed +from pathlib import Path +from typing import List, Set, Tuple + +# Configuration +OPT_LEVELS = ["O0", "O1", "O2", "O3", "Os", "Oz"] +DEFAULT_MAX_WORKERS = 100 + +logger = logging.getLogger(__name__) + + +class TripletResult: + """Result from processing a single LLVM IR file""" + + __slots__ = ["triplets", "max_relation"] + + def __init__(self, triplets: Set[str], max_relation: int): + self.triplets = triplets + self.max_relation = max_relation + + +class IR2VecTripletGenerator: + """Main class for generating IR2Vec triplets""" + + def __init__( + self, + llvm_build_dir: Path, + num_optimizations: int, + output_dir: Path, + max_workers: int = DEFAULT_MAX_WORKERS, + ): + self.llvm_build_dir = llvm_build_dir + self.num_optimizations = num_optimizations + self.output_dir = output_dir + self.max_workers = max_workers + + # Tool paths + self.opt_binary = os.path.join(llvm_build_dir, "bin", "opt") + self.ir2vec_binary = os.path.join(llvm_build_dir, "bin", "llvm-ir2vec") + + self._validate_setup() + + def _validate_setup(self): + """Validate that all required tools and paths exist""" + if not self.llvm_build_dir.exists(): + raise FileNotFoundError( + f"LLVM build directory not found: {self.llvm_build_dir}" + ) + + if not os.path.isfile(self.opt_binary) or not os.access( + self.opt_binary, os.X_OK + ): + raise FileNotFoundError( + f"opt binary not found or not executable: {self.opt_binary}" + ) + + if not os.path.isfile(self.ir2vec_binary) or not os.access( + self.ir2vec_binary, os.X_OK + ): + raise FileNotFoundError( + f"llvm-ir2vec binary not found or not executable: {self.ir2vec_binary}" + ) + + if not (1 <= self.num_optimizations <= len(OPT_LEVELS)): + raise ValueError( + f"Number of optimizations must be between 1-{len(OPT_LEVELS)}" + ) + + self.output_dir.mkdir(parents=True, exist_ok=True) + + def _select_optimization_levels(self) -> List[str]: + """Select unique random optimization levels""" + return random.sample(OPT_LEVELS, self.num_optimizations) + + def _process_single_file(self, input_file: Path) -> TripletResult: + """Process a single LLVM IR file with multiple optimization levels""" + all_triplets = set() + max_relation = 1 + opt_levels = self._select_optimization_levels() + + for opt_level in opt_levels: + try: + triplets, file_max_relation = self._run_pipeline(input_file, opt_level) + if triplets: + all_triplets.update(triplets) + max_relation = max(max_relation, file_max_relation) + logger.debug( + f"Generated {len(triplets)} triplets for {input_file} with {opt_level}" + ) + except Exception as e: + logger.warning(f"Error processing {input_file} with {opt_level}: {e}") + + return TripletResult(all_triplets, max_relation) + + def _run_pipeline(self, input_file: Path, opt_level: str) -> Tuple[Set[str], int]: + """Run opt | llvm-ir2vec pipeline elegantly.""" + pipeline_cmd = ( + f'"{self.opt_binary}" -{opt_level} "{input_file}" -o - | ' + f'"{self.ir2vec_binary}" --mode=triplets - -o -' + ) + + try: + result = subprocess.run( + pipeline_cmd, shell=True, capture_output=True, text=True, check=True + ) + return self._parse_triplet_output(result.stdout) + except subprocess.CalledProcessError: + return set(), 1 + + def _parse_triplet_output(self, output: str) -> Tuple[Set[str], int]: + """Parse triplet output and extract max relation""" + if not output.strip(): + return set(), 1 + + lines = output.strip().split("\n") + max_relation = 1 + + # Extract max relation from metadata line + if lines and lines[0].startswith("MAX_RELATION="): + max_relation = int(lines[0].split("=")[1]) + lines = lines[1:] + + # Remove duplicate triplets by converting to a set + return set(lines), max_relation + + def generate_triplets(self, file_list: Path) -> None: + """Main method to generate triplets from a list of LLVM IR files""" + input_files = self._read_file_list(file_list) + logger.info( + f"Processing {len(input_files)} files with {self.num_optimizations} " + f"optimization levels using {self.max_workers} workers" + ) + + all_triplets = set() + global_max_relation = 1 + + with ThreadPoolExecutor(max_workers=self.max_workers) as executor: + future_to_file = { + executor.submit(self._process_single_file, file): file + for file in input_files + } + + for future in as_completed(future_to_file): + try: + result = future.result() + all_triplets.update(result.triplets) + global_max_relation = max(global_max_relation, result.max_relation) + except Exception as e: + file_path = future_to_file[future] + logger.error(f"Error processing {file_path}: {e}") + + self._generate_output_files(all_triplets, global_max_relation) + logger.info("Processing completed successfully") + + def _read_file_list(self, file_list: Path) -> List[Path]: + """Read and validate the list of input files""" + input_files = [] + with open(file_list, "r") as f: + for line_num, line in enumerate(f, 1): + if line := line.strip(): + file_path = Path(line) + if file_path.exists(): + input_files.append(file_path) + else: + logger.warning(f"File not found (line {line_num}): {file_path}") + + if not input_files: + raise ValueError("No valid input files found") + return input_files + + def _generate_output_files(self, all_triplets: Set[str], max_relation: int) -> None: + """Generate the final output files""" + logger.info(f"Generating output files with {len(all_triplets)} unique triplets") + + # Write all output files -- train2id.txt, entity2id.txt, relation2id.txt + train2id_file = os.path.join(self.output_dir, "train2id.txt") + entity2id_file = os.path.join(self.output_dir, "entity2id.txt") + relation2id_file = os.path.join(self.output_dir, "relation2id.txt") + + with open(train2id_file, "w") as f: + f.write(f"{len(all_triplets)}\n") + f.writelines(f"{triplet}\n" for triplet in all_triplets) + + self._generate_entity2id(entity2id_file) + self._generate_relation2id(relation2id_file, max_relation) + + def _generate_entity2id(self, output_file: Path) -> None: + """Generate entity2id.txt using llvm-ir2vec""" + subprocess.run( + [str(self.ir2vec_binary), "--mode=entities", "-o", str(output_file)], + check=True, + capture_output=True, + ) + + def _generate_relation2id(self, output_file: Path, max_relation: int) -> None: + """Generate relation2id.txt from max relation""" + max_relation = max(max_relation, 1) # At least Type and Next relations + num_relations = max_relation + 1 + + with open(output_file, "w") as f: + f.write(f"{num_relations}\n") + f.write("Type\t0\n") + f.write("Next\t1\n") + f.writelines(f"Arg{i-2}\t{i}\n" for i in range(2, num_relations)) + + +def main(): + """Main entry point""" + parser = argparse.ArgumentParser( + description="Generate IR2Vec triplets from LLVM IR files", + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + + parser.add_argument( + "llvm_build_dir", type=Path, help="Path to LLVM build directory" + ) + parser.add_argument( + "num_optimizations", + type=int, + help="Number of optimization levels to apply (1-6)", + ) + parser.add_argument( + "ll_file_list", + type=Path, + help="File containing list of LLVM IR files to process", + ) + parser.add_argument( + "output_dir", type=Path, help="Output directory for generated files" + ) + parser.add_argument( + "-j", + "--max-workers", + type=int, + default=DEFAULT_MAX_WORKERS, + help=f"Maximum number of parallel workers (default: {DEFAULT_MAX_WORKERS})", + ) + parser.add_argument( + "-v", "--verbose", action="store_true", help="Enable debug logging" + ) + parser.add_argument( + "-q", "--quiet", action="store_true", help="Suppress all output except errors" + ) + + args = parser.parse_args() + + # Configure logging + level = ( + logging.ERROR + if args.quiet + else (logging.DEBUG if args.verbose else logging.INFO) + ) + logging.basicConfig( + level=level, + format="[%(asctime)s] %(levelname)s: %(message)s", + datefmt="%H:%M:%S", + ) + + try: + generator = IR2VecTripletGenerator( + args.llvm_build_dir, + args.num_optimizations, + args.output_dir, + args.max_workers, + ) + generator.generate_triplets(args.ll_file_list) + except Exception as e: + logger.error(f"Error: {e}") + sys.exit(1) + + +if __name__ == "__main__": + main() >From 3f8c21f103225716659ed7de8031767ae17bf52b Mon Sep 17 00:00:00 2001 From: svkeerthy <venkatakeer...@google.com> Date: Wed, 16 Jul 2025 23:42:39 +0000 Subject: [PATCH 2/2] Remove generateTriplets.py to move to next PR --- .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ------------------ 1 file changed, 291 deletions(-) delete mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py diff --git a/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py b/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py deleted file mode 100644 index 0858d10ce0138..0000000000000 --- a/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py +++ /dev/null @@ -1,291 +0,0 @@ -# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. -# See https://llvm.org/LICENSE.txt for license information. -# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception -"""IR2Vec Triplet Generator - -Generates IR2Vec triplets by applying random optimization levels to LLVM IR files -and extracting triplets using llvm-ir2vec. Automatically generates preprocessed -files: entity2id.txt, relation2id.txt, and train2id.txt. - -Usage: - python generateTriplets.py <llvm_build_dir> <num_optimizations> <ll_file_list> <output_dir> -""" - -import argparse -import logging -import os -import random -import subprocess -import sys -from concurrent.futures import ThreadPoolExecutor, as_completed -from pathlib import Path -from typing import List, Set, Tuple - -# Configuration -OPT_LEVELS = ["O0", "O1", "O2", "O3", "Os", "Oz"] -DEFAULT_MAX_WORKERS = 100 - -logger = logging.getLogger(__name__) - - -class TripletResult: - """Result from processing a single LLVM IR file""" - - __slots__ = ["triplets", "max_relation"] - - def __init__(self, triplets: Set[str], max_relation: int): - self.triplets = triplets - self.max_relation = max_relation - - -class IR2VecTripletGenerator: - """Main class for generating IR2Vec triplets""" - - def __init__( - self, - llvm_build_dir: Path, - num_optimizations: int, - output_dir: Path, - max_workers: int = DEFAULT_MAX_WORKERS, - ): - self.llvm_build_dir = llvm_build_dir - self.num_optimizations = num_optimizations - self.output_dir = output_dir - self.max_workers = max_workers - - # Tool paths - self.opt_binary = os.path.join(llvm_build_dir, "bin", "opt") - self.ir2vec_binary = os.path.join(llvm_build_dir, "bin", "llvm-ir2vec") - - self._validate_setup() - - def _validate_setup(self): - """Validate that all required tools and paths exist""" - if not self.llvm_build_dir.exists(): - raise FileNotFoundError( - f"LLVM build directory not found: {self.llvm_build_dir}" - ) - - if not os.path.isfile(self.opt_binary) or not os.access( - self.opt_binary, os.X_OK - ): - raise FileNotFoundError( - f"opt binary not found or not executable: {self.opt_binary}" - ) - - if not os.path.isfile(self.ir2vec_binary) or not os.access( - self.ir2vec_binary, os.X_OK - ): - raise FileNotFoundError( - f"llvm-ir2vec binary not found or not executable: {self.ir2vec_binary}" - ) - - if not (1 <= self.num_optimizations <= len(OPT_LEVELS)): - raise ValueError( - f"Number of optimizations must be between 1-{len(OPT_LEVELS)}" - ) - - self.output_dir.mkdir(parents=True, exist_ok=True) - - def _select_optimization_levels(self) -> List[str]: - """Select unique random optimization levels""" - return random.sample(OPT_LEVELS, self.num_optimizations) - - def _process_single_file(self, input_file: Path) -> TripletResult: - """Process a single LLVM IR file with multiple optimization levels""" - all_triplets = set() - max_relation = 1 - opt_levels = self._select_optimization_levels() - - for opt_level in opt_levels: - try: - triplets, file_max_relation = self._run_pipeline(input_file, opt_level) - if triplets: - all_triplets.update(triplets) - max_relation = max(max_relation, file_max_relation) - logger.debug( - f"Generated {len(triplets)} triplets for {input_file} with {opt_level}" - ) - except Exception as e: - logger.warning(f"Error processing {input_file} with {opt_level}: {e}") - - return TripletResult(all_triplets, max_relation) - - def _run_pipeline(self, input_file: Path, opt_level: str) -> Tuple[Set[str], int]: - """Run opt | llvm-ir2vec pipeline elegantly.""" - pipeline_cmd = ( - f'"{self.opt_binary}" -{opt_level} "{input_file}" -o - | ' - f'"{self.ir2vec_binary}" --mode=triplets - -o -' - ) - - try: - result = subprocess.run( - pipeline_cmd, shell=True, capture_output=True, text=True, check=True - ) - return self._parse_triplet_output(result.stdout) - except subprocess.CalledProcessError: - return set(), 1 - - def _parse_triplet_output(self, output: str) -> Tuple[Set[str], int]: - """Parse triplet output and extract max relation""" - if not output.strip(): - return set(), 1 - - lines = output.strip().split("\n") - max_relation = 1 - - # Extract max relation from metadata line - if lines and lines[0].startswith("MAX_RELATION="): - max_relation = int(lines[0].split("=")[1]) - lines = lines[1:] - - # Remove duplicate triplets by converting to a set - return set(lines), max_relation - - def generate_triplets(self, file_list: Path) -> None: - """Main method to generate triplets from a list of LLVM IR files""" - input_files = self._read_file_list(file_list) - logger.info( - f"Processing {len(input_files)} files with {self.num_optimizations} " - f"optimization levels using {self.max_workers} workers" - ) - - all_triplets = set() - global_max_relation = 1 - - with ThreadPoolExecutor(max_workers=self.max_workers) as executor: - future_to_file = { - executor.submit(self._process_single_file, file): file - for file in input_files - } - - for future in as_completed(future_to_file): - try: - result = future.result() - all_triplets.update(result.triplets) - global_max_relation = max(global_max_relation, result.max_relation) - except Exception as e: - file_path = future_to_file[future] - logger.error(f"Error processing {file_path}: {e}") - - self._generate_output_files(all_triplets, global_max_relation) - logger.info("Processing completed successfully") - - def _read_file_list(self, file_list: Path) -> List[Path]: - """Read and validate the list of input files""" - input_files = [] - with open(file_list, "r") as f: - for line_num, line in enumerate(f, 1): - if line := line.strip(): - file_path = Path(line) - if file_path.exists(): - input_files.append(file_path) - else: - logger.warning(f"File not found (line {line_num}): {file_path}") - - if not input_files: - raise ValueError("No valid input files found") - return input_files - - def _generate_output_files(self, all_triplets: Set[str], max_relation: int) -> None: - """Generate the final output files""" - logger.info(f"Generating output files with {len(all_triplets)} unique triplets") - - # Write all output files -- train2id.txt, entity2id.txt, relation2id.txt - train2id_file = os.path.join(self.output_dir, "train2id.txt") - entity2id_file = os.path.join(self.output_dir, "entity2id.txt") - relation2id_file = os.path.join(self.output_dir, "relation2id.txt") - - with open(train2id_file, "w") as f: - f.write(f"{len(all_triplets)}\n") - f.writelines(f"{triplet}\n" for triplet in all_triplets) - - self._generate_entity2id(entity2id_file) - self._generate_relation2id(relation2id_file, max_relation) - - def _generate_entity2id(self, output_file: Path) -> None: - """Generate entity2id.txt using llvm-ir2vec""" - subprocess.run( - [str(self.ir2vec_binary), "--mode=entities", "-o", str(output_file)], - check=True, - capture_output=True, - ) - - def _generate_relation2id(self, output_file: Path, max_relation: int) -> None: - """Generate relation2id.txt from max relation""" - max_relation = max(max_relation, 1) # At least Type and Next relations - num_relations = max_relation + 1 - - with open(output_file, "w") as f: - f.write(f"{num_relations}\n") - f.write("Type\t0\n") - f.write("Next\t1\n") - f.writelines(f"Arg{i-2}\t{i}\n" for i in range(2, num_relations)) - - -def main(): - """Main entry point""" - parser = argparse.ArgumentParser( - description="Generate IR2Vec triplets from LLVM IR files", - formatter_class=argparse.RawDescriptionHelpFormatter, - ) - - parser.add_argument( - "llvm_build_dir", type=Path, help="Path to LLVM build directory" - ) - parser.add_argument( - "num_optimizations", - type=int, - help="Number of optimization levels to apply (1-6)", - ) - parser.add_argument( - "ll_file_list", - type=Path, - help="File containing list of LLVM IR files to process", - ) - parser.add_argument( - "output_dir", type=Path, help="Output directory for generated files" - ) - parser.add_argument( - "-j", - "--max-workers", - type=int, - default=DEFAULT_MAX_WORKERS, - help=f"Maximum number of parallel workers (default: {DEFAULT_MAX_WORKERS})", - ) - parser.add_argument( - "-v", "--verbose", action="store_true", help="Enable debug logging" - ) - parser.add_argument( - "-q", "--quiet", action="store_true", help="Suppress all output except errors" - ) - - args = parser.parse_args() - - # Configure logging - level = ( - logging.ERROR - if args.quiet - else (logging.DEBUG if args.verbose else logging.INFO) - ) - logging.basicConfig( - level=level, - format="[%(asctime)s] %(levelname)s: %(message)s", - datefmt="%H:%M:%S", - ) - - try: - generator = IR2VecTripletGenerator( - args.llvm_build_dir, - args.num_optimizations, - args.output_dir, - args.max_workers, - ) - generator.generate_triplets(args.ll_file_list) - except Exception as e: - logger.error(f"Error: {e}") - sys.exit(1) - - -if __name__ == "__main__": - main() _______________________________________________ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits