Hi all,
Here's one more patch to bring Irregex completely up to date with the
just released 0.9.10 version. It fixes an issue where "bol" would
overlap with newline characters in a weird way.
Cheers,
Peter
From efe932f4fa7afbc56865d33edfbf6836c34ce919 Mon Sep 17 00:00:00 2001
From: Peter Bex
Date: Tue, 6 Jul 2021 15:15:34 +0200
Subject: [PATCH] Bump irregex to upstream commit 29334af, bringing us to
version 0.9.10
This fixes upstream ticket #25, where newlines would overlap with
"bol" in situations where a string matches multiple times due to
inconsistent handling.
---
NEWS | 6 --
irregex-core.scm | 16 +++-
tests/test-irregex.scm | 4
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/NEWS b/NEWS
index 53a40f0f..2e254e48 100644
--- a/NEWS
+++ b/NEWS
@@ -6,15 +6,17 @@
- Fixed a bug where optimisations for `irregex-match?` would cause
runtime errors due to the inlined specialisations not being
fully-expanded (see #1690).
- - Irregex has been updated to upstream 0.9.9, which fixes behaviour
+ - Irregex has been updated to upstream 0.9.10, which fixes behaviour
of irregex-replace/all with positive lookbehind so all matches are
replaced instead of only the first (reported by Kay Rhodes), and
a regression regarding replacing empty matches which was introduced
by the fixes in 0.9.7 (reported by Sandra Snan). Also, the
http-url shorthand now allows any top-level domain and the old
"top-level-domain" now also supports "edu" (fixed by Sandra Snan).
-Finally, a problem was fixed with capturing groups inside a kleene
+Also, a problem was fixed with capturing groups inside a kleene
star, which could sometimes return incorrect parts of the match.
+Finally, "bol" handling was fixed to handle newlines consistently
+so that multiple matches don't overlap (reported by Sandra Snan).
- current-milliseconds has been deprecated in favor of the name
current-process-milliseconds, to avoid confusion due to naming
of current-milliseconds versus current-seconds, which do something
diff --git a/irregex-core.scm b/irregex-core.scm
index f86b7992..55e9a6c0 100644
--- a/irregex-core.scm
+++ b/irregex-core.scm
@@ -30,6 +30,10 @@
History
+;; 0.9.10: 2021/07/06 - fixes for submatches under kleene star, empty seqs
+;; in alternations, and bol in folds for backtracking
+;; matcher (thanks John Clements and snan for reporting
+;; and Peter Bex for fixing)
;; 0.9.9: 2021/05/14 - more comprehensive fix for repeated empty matches
;; 0.9.8: 2020/07/13 - fix irregex-replace/all with look-behind patterns
;; 0.9.7: 2019/12/31 - more intuitive handling of empty matches in -fold,
@@ -3508,9 +3512,10 @@
(fail
((bol)
(lambda (cnk init src str i end matches fail)
- (if (or (and (eq? src (car init)) (eqv? i (cdr init)))
- (and (> i ((chunker-get-start cnk) src))
-(eqv? #\newline (string-ref str (- i 1)
+ (if (let ((ch (if (> i ((chunker-get-start cnk) src))
+ (string-ref str (- i 1))
+ (chunker-prev-char cnk init src
+ (or (not ch) (eqv? #\newline ch)))
(next cnk init src str i end matches fail)
(fail
((bow)
@@ -3908,13 +3913,14 @@
matches)))
(if (not m)
(finish from acc)
-(let ((j (%irregex-match-end-index m 0))
+(let ((j-start (%irregex-match-start-index m 0))
+ (j (%irregex-match-end-index m 0))
(acc (kons from m acc)))
(irregex-reset-matches! matches)
(cond
((flag-set? (irregex-flags irx) ~consumer?)
(finish j acc))
- ((= j i)
+ ((= j j-start)
;; skip one char forward if we match the empty string
(lp (list str j end) j (+ j 1) acc))
(else
diff --git a/tests/test-irregex.scm b/tests/test-irregex.scm
index 5cf5b685..0888f09b 100644
--- a/tests/test-irregex.scm
+++ b/tests/test-irregex.scm
@@ -451,6 +451,10 @@
(irregex-extract (irregex "[aeiou]*") "foobarbaz"))
(test-equal '("Line 1\n" "Line 2\n" "Line 3")
(irregex-split 'bol "Line 1\nLine 2\nLine 3"))
+ (test-equal '("foo\n" "bar\n" "baz\n")
+ (irregex-extract '(: bol (+ alpha) newline) "\nfoo\nbar\nbaz\n"))
+ (test-equal '("\nblah" "\nblah" "\nblah")
+ (irregex-extract '(: newline "blah" eol) "\nblah\nblah\nblah\n"))
)
--
2.20.1
signature.asc
Description: PGP signature