Hi,
gfortran has a few long-standing bugs wrt module handling. The more
fundamental, and also more difficult to fix, issue is that we re-read
and re-parse module files every time a USE statement is encountered,
instead of once per translation unit. See PR 25708. Another issue, PR
40958, is that module files can be quite big which exacerbates the PR
25708 issues.
The attached patch fixes PR 40958 by compressing the module files with
zlib and storing them in the gzip format (RFC 1952). I chose zlib
because it's a) ubiquitous and b) there's already a copy of zlib in
the GCC source tree, so this doesn't introduce any further build
dependencies. Since the mod files with the patch are in the gzip
format, one can use tools like zcat, zless, zgrep, zdiff etc. to
inspect the uncompressed contents easily (one can also use gunzip if
one first copies the module file to a temporary file with .gz
extension).
However, there's a couple of issues related to seeking in gzip files
(gzseek() instead of fseek() which is currently used). One is fixed by
the patch, the other is a potentially serious performance issue.
First, for a writable gzip file, seeking backwards is not allowed.
Currently when writing a module file, we first write a placeholder for
the MD5, then write the actual module content while updating the MD5
sum in memory as we go, and finally we seek back and write the final
MD5 value. However, the gzip file format contains a solution, 8 bytes
from the end of the file a CRC32 checksum of the (uncompressed)
content is stored. So the patch rips out the MD5 machinery, and
instead compares these CRC32 checksums to determine whether to replace
an existing module file or not (from the command line, one can check
the CRC32 with 'zcat -l -v filename'). As a result, the module version
number has been bumped as well.
The second issue that the patch doesn't address in any way, is that
while seeking on a gzip file in read mode is allowed, from zlib.h: If
the file is opened for reading, this function is emulated but can be
extremely slow.. Unfortunately, when reading a module file we do seek
back and forth in it. Based on a brief inspection of the code, most if
not all of these seeks are for a very short distance (typically peek a
few bytes ahead in the stream, then seek back), and if the gzseek()
function is somewhat clever about seeking within the read buffer, this
might not be so slow after all. OTOH, if every gzseek() call means
restarting the inflation from the beginning of the file, the impact
could be quite bad.
The patch passes regression testing except for one failure,
module_md5_1.f90 which should be removed. Based on some quick testing,
the size of module files are reduced by a factor of 5 or thereabouts.
I haven't checked performance, in particular one would need to check
the second issue described above for some of those testcases
generating large module files. I think there was some single-file
version of cp2k somewhere that could be used for this, or are there
other appropriate tests somewhere that aren't too difficult to set up?
So at the moment, I'm not proposing this patch for inclusion, consider
it a RFC. Especially appropriate benchmark results and/or pointers to
easy-to-set-up testcases are appreciated.
In case the seeking in read mode is an issue, I suspect it wouldn't be
too hard to fix the parsing to not require it, but I think that would
push the patch more towards 4.8 material.
--
Janne Blomqvist
diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h
index 17ebd58..d6152b3 100644
--- a/gcc/fortran/gfortran.h
+++ b/gcc/fortran/gfortran.h
@@ -29,6 +29,9 @@ along with GCC; see the file COPYING3. If not see
multiple header files. Besides, Microsoft's winnt.h was 250k last
time I looked, so by comparison this is perfectly reasonable. */
+#include config.h
+#include system.h
+
/* Declarations common to the front-end and library are put in
libgfortran/libgfortran_frontend.h */
#include libgfortran.h
@@ -38,6 +41,7 @@ along with GCC; see the file COPYING3. If not see
#include coretypes.h
#include input.h
#include splay-tree.h
+#include zlib.h
/* Major control parameters. */
@@ -2345,7 +2349,8 @@ void gfc_add_include_path (const char *, bool, bool);
void gfc_add_intrinsic_modules_path (const char *);
void gfc_release_include_path (void);
FILE *gfc_open_included_file (const char *, bool, bool);
-FILE *gfc_open_intrinsic_module (const char *);
+gzFile gfc_gzopen_included_file (const char *, bool, bool);
+gzFile gfc_open_intrinsic_module (const char *);
int gfc_at_end (void);
int gfc_at_eof (void);
diff --git a/gcc/fortran/module.c b/gcc/fortran/module.c
index 62f7598..9fa8c97 100644
--- a/gcc/fortran/module.c
+++ b/gcc/fortran/module.c
@@ -72,15 +72,15 @@ along with GCC; see the file COPYING3. If not see
#include arith.h
#include match.h
#include parse.h /* FIXME */
-#include md5.h
#include constructor.h
#include cpp.h
+#include zlib.h
#define