This patch series aims to optimize performance for recursively parsing header files in Coccinelle.
Coccinelle's C parsing subsystem has an option called --recursive-includes to recursively parse header files. This is used for type inference/annotation. Previously, using --recursive-includes on the entire Linux kernel source code would take far too long. On my computer with the following specs, - Processor: AMD Ryzen 5 3550H - RAM: 8 GB it would take close to 7 hours to complete. The optimization that this patch series implements reduces that time to 1 hour. The following is a high-level description of what has been implemented: - As header files are recursively parsed, they are scanned for the following: - fields of structs/unions/enums - typedefs - function prototypes - global variables The names of the above are stored in a "name cache", i.e. a hashtable to map the name to the files it is declared in. - A dependency graph is built to determine dependencies between all the files in the codebase. - In the type annotation phase of the C subsystem, if a function call, struct/union field or identifier is encountered, the type of which is not known to the annoter, the name cache is checked for the name. - The name cache gives a list of files that the name is declared/defined in. These files are cross checked with the dependency graph to determine if any of these are reachable by the file that the annoter is working on. - If a reachable header file is found, that file is parsed and the type associated to the name is returned. Different approaches that were attempted to alleviate this issue, and the problems with each are as follows: - Caching the most recently used files: A LRU cache to store ASTs of the most recently encountered header files. The problem with this approach is the amount of memory it takes to cache the header file ASTs. - Caching the most troublesome files: A pseudo-LFU cache to store files that cumulatively take the longest to parse, and thus bloat the time taken. The problem with this approach is the amount of memory it takes to cache the header file ASTs. - Skipping unparsable locations in header files: Skipping top-level items in a header file that cannot be parsed. This approach does not produce even close to the amount of optimization needed. The next step from here would be: - Maintain a small but persistent cache of header files in groups of directories. Leverage multiprocessing for parsing these header files. - Leverage multiprocessing to parse header files initially for name extraction. - Performing some initial matching with the semantic patch to determine if a C file matches. If matches are found, call the annoter and recursively parse header files for type annotation. - Recursively parse all header files only once and build a large type environment. Use the dependency graph to determine reachability. This has potential memory usage issues though. Changes in v2: -------------- - Change occurences of 'begin' and 'match' on the same line with something else to the next line for better readability. Makefile | 2 parsing_c/includes_cache.ml | 286 +++++++++++++++++++++++++++++++++++++++++++ parsing_c/includes_cache.mli | 47 +++++++ parsing_c/parse_c.ml | 27 +++- parsing_c/type_annoter_c.ml | 130 ++++++++++++++++--- 5 files changed, 466 insertions(+), 26 deletions(-) _______________________________________________ Cocci mailing list Cocci@systeme.lip6.fr https://systeme.lip6.fr/mailman/listinfo/cocci