[Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing

Jaskaran Singh Wed, 09 Sep 2020 23:19:23 -0700

This patch series aims to optimize performance for recursively parsing
header files in Coccinelle.


Coccinelle's C parsing subsystem has an option called --recursive-includes
to recursively parse header files. This is used for type
inference/annotation.

Previously, using --recursive-includes on the entire Linux kernel source
code would take far too long. On my computer with the following specs,
        - Processor: AMD Ryzen 5 3550H
        - RAM: 8 GB
it would take close to 7 hours to complete.  The optimization that this
patch series implements reduces that time to 1 hour.

The following is a high-level description of what has been implemented:
- As header files are recursively parsed, they are scanned for the
  following:
        - fields of structs/unions/enums
        - typedefs
        - function prototypes
        - global variables
  The names of the above are stored in a "name cache", i.e. a hashtable to
  map the name to the files it is declared in.
- A dependency graph is built to determine dependencies between all the
  files in the codebase.
- In the type annotation phase of the C subsystem, if a function call,
  struct/union field or identifier is encountered, the type of which is
  not known to the annoter, the name cache is checked for the name.
- The name cache gives a list of files that the name is declared/defined
  in.  These files are cross checked with the dependency graph to
  determine if any of these are reachable by the file that the annoter is
  working on.
- If a reachable header file is found, that file is parsed and the type
  associated to the name is returned.

Different approaches that were attempted to alleviate this issue, and the
problems with each are as follows:
- Caching the most recently used files: A LRU cache to store ASTs of the
  most recently encountered header files. The problem with this approach
  is the amount of memory it takes to cache the header file ASTs.
- Caching the most troublesome files: A pseudo-LFU cache to store files
  that cumulatively take the longest to parse, and thus bloat the time
  taken. The problem with this approach is the amount of memory it takes
  to cache the header file ASTs.
- Skipping unparsable locations in header files: Skipping top-level items
  in a header file that cannot be parsed. This approach does not produce
  even close to the amount of optimization needed.

The next step from here would be:
- Maintain a small but persistent cache of header files in groups of
  directories. Leverage multiprocessing for parsing these header files.
- Leverage multiprocessing to parse header files initially for name
  extraction.
- Performing some initial matching with the semantic patch to determine if
  a C file matches. If matches are found, call the annoter and recursively
  parse header files for type annotation.
- Recursively parse all header files only once and build a large type
  environment. Use the dependency graph to determine reachability. This
  has potential memory usage issues though.


Changes in v2:
--------------
- Change occurences of 'begin' and 'match' on the same line with something else
  to the next line for better readability.


 Makefile                     |    2 
 parsing_c/includes_cache.ml  |  286 +++++++++++++++++++++++++++++++++++++++++++
 parsing_c/includes_cache.mli |   47 +++++++
 parsing_c/parse_c.ml         |   27 +++-
 parsing_c/type_annoter_c.ml  |  130 ++++++++++++++++---
 5 files changed, 466 insertions(+), 26 deletions(-)


_______________________________________________
Cocci mailing list
[email protected]
https://systeme.lip6.fr/mailman/listinfo/cocci

[Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing

Reply via email to