Hi Mark, Mark H Weaver <m...@netris.org> writes:
> Hi Maxim, > > Thanks for working on this. I found a problem with this patch, > and I also have a suggestion. Please see below. > > Maxim Cournoyer <maxim.courno...@gmail.com> writes: > >> From 9891e428eae0ed24e0d61862b3f5e298606b79eb Mon Sep 17 00:00:00 2001 >> From: Maxim Cournoyer <maxim.courno...@gmail.com> >> Date: Sun, 14 Jan 2018 20:31:33 -0500 >> Subject: [PATCH] utils: Prevent substitute from crashing on files containing >> NUL chars. >> >> Fixes issue #30116. >> >> * guix/build/utils.scm (substitute): Add condition to skip lines containing >> the NUL character. >> --- >> guix/build/utils.scm | 44 ++++++++++++++++++++++++++------------------ >> 1 file changed, 26 insertions(+), 18 deletions(-) >> >> diff --git a/guix/build/utils.scm b/guix/build/utils.scm >> index 7391307c8..975f4e70a 100644 >> --- a/guix/build/utils.scm >> +++ b/guix/build/utils.scm >> @@ -3,6 +3,7 @@ >> ;;; Copyright © 2013 Andreas Enge <andr...@enge.fr> >> ;;; Copyright © 2013 Nikita Karetnikov <nik...@karetnikov.org> >> ;;; Copyright © 2015 Mark H Weaver <m...@netris.org> >> +;;; Copyright © 2018 Maxim Cournoyer <maxim.courno...@gmail.com> >> ;;; >> ;;; This file is part of GNU Guix. >> ;;; >> @@ -621,28 +622,35 @@ PROC as (PROC LINE MATCHES); PROC must return the line >> that will be written as >> a substitution of the original line. Be careful about using '$' to match >> the >> end of a line; by itself it won't match the terminating newline of a line." >> (let ((rx+proc (map (match-lambda >> - (((? regexp? pattern) . proc) >> - (cons pattern proc)) >> - ((pattern . proc) >> - (cons (make-regexp pattern regexp/extended) >> - proc))) >> + (((? regexp? pattern) . proc) >> + (cons pattern proc)) >> + ((pattern . proc) >> + (cons (make-regexp pattern regexp/extended) >> + proc))) >> pattern+procs))) >> (with-atomic-file-replacement file >> (lambda (in out) >> (let loop ((line (read-line in 'concat))) >> - (if (eof-object? line) >> - #t >> - (let ((line (fold (lambda (r+p line) >> - (match r+p >> - ((regexp . proc) >> - (match (list-matches regexp line) >> - ((and m+ (_ _ ...)) >> - (proc line m+)) >> - (_ line))))) >> - line >> - rx+proc))) >> - (display line out) >> - (loop (read-line in 'concat))))))))) >> + (cond >> + ((eof-object? line) >> + #t) >> + ((string-contains line (make-string 1 #\nul)) >> + ;; The regexp functions of the GNU C library (which Guile uses) >> + ;; cannot deal with NUL characters, so skip to the next line. >> + (format #t "skipping line with NUL characters: ~s\n" line) >> + (loop (read-line in 'concat))) > > This code will unconditionally *delete* all lines that contain NULs. > > This will happen because the lines with NULs are not being written to > the output file, which will replace the original file when this loop > reaches EOF. So, any lines that are not copied to the output will be > lost. > > To preserve the lines with NULs, you should call (display line out) > before calling 'loop'. Good observation! I agree that we should keep limit the effect of ignoring NULs only to the substitution. > Also, please use (string-index line #\nul) to check for NULs instead of > 'string-contains'. It should be more efficient. OK! > >> + (else >> + (let ((line (fold (lambda (r+p line) >> + (match r+p >> + ((regexp . proc) >> + (match (list-matches regexp line) >> + ((and m+ (_ _ ...)) >> + (proc line m+)) >> + (_ line))))) >> + line >> + rx+proc))) >> + (display line out) >> + (loop (read-line in 'concat)))))))))) > > With the changes suggested above, I would have no objection to pushing > this to core-updates. However, it occurs to me that we could handle the > NUL case in a better way: > > Since the C regex functions that we use cannot handle NUL bytes, we > could use a different code point to represent NUL during those > operations. We could choose a code point from one of the Unicode > Private Use Areas <https://en.wikipedia.org/wiki/Private_Use_Areas> that > does not occur in the string. > > Let NUL* be the code point which will represent NUL bytes. First > replace all NULs with NUL*s, then perform the substitutions, and finally > replace all ALT*s with NULs before writing to the output. Do I understand this transformation as NULs -> NUL*s and back from NUL*s -> NULs correctly? I'm not sure how NUL*s became ALT*s in your explanation. > As an important optimization, we should avoid performing these extra > operations unless (string-index line #\nul) finds a NUL. OK. > We could then perform these extra substitutions with simple, inefficient > code. Maybe something like this (untested): > > (with-atomic-file-replacement file > (lambda (in out) > (let loop ((line (read-line in 'concat))) > (if (eof-object? line) > #t > (let* ((nul* (or (and (string-index line #\nul) > (unused-private-use-code-point line)) > #\nul)) > (line* (replace-char #\nul nul* line)) > (line1* (fold (lambda (r+p line) > (match r+p > ((regexp . proc) > (match (list-matches regexp line) > ((and m+ (_ _ ...)) > (proc line m+)) > (_ line))))) > line* > rx+proc)) > (line1 (replace-char nul* #\nul line1*))) > (display line1 out) > (loop (read-line in 'concat))))))))) > > > Where the following additional private procedures would be added to > (guix build utils) above the definition for 'substitute': > > (define (unused-private-use-code-point s) > "Find a code point within a Unicode Private Use Area that is not > present in S, and return the corresponding character object. If one > cannot be found, return false." > (define (scan lo hi) > (and (<= lo hi) > (let ((c (integer->char lo))) > (if (string-index s c) > (scan (+ lo 1) hi) > c)))) > (or (scan #xE000 #xF8FF) > (scan #xF0000 #xFFFFD) > (scan #x100000 #x10FFFD))) > > (define (replace-char c1 c2 s) > "Return a string which is equal to S except with all instances of C1 > replaced by C2. If C1 and C2 are equal, return S." > (if (char=? c1 c2) > s > (string-map (lambda (c) > (if (char=? c c1) > c2 > c)) > s))) > > What do you think? It raises the complexity level a bit for something which doesn't seem to be a very common scenario, but otherwise seems a very elegant workaround. It seems to me that your implementation is already pretty complete. I'll try write a test for validating it and report back. Thank you for sharing your ideas! Maxim