On Tuesday 24 February 2004 03.07, John Meacham wrote: > Inspired by an idea by Andrew Pang and an old project of mine, I decided > to fill out a reusable regular expression library which is similar to > Perl's, but much more expressive. > ...
Hi, Thanks! I am grateful of your efforts because I have since long missed some typical text processing functionality in haskell. (Besides a more complete regular expression library I think many haskellers miss string constructors like an 'official' version of printf/format and maybe also some sort of 'here documents'.) Below follows some of my thoughts regarding a complete regex library, in the hope that this will be of any inspiration. 1. Replacement A regex library must contain functions for replacement with regular expressions. One could think this is trivial to implement given a match function, but there are some tricky choices to be made regarding empty matches (this also applies to splitting a string into fields with a regexp). Also there is questions about the interface of these functions. In my own Text.Regex wrapper I have the functions. data Match = Match {before :: String, after :: String, groups :: [String]} ... substWithPat :: Rexex -> String -> (Int -> Bool) -> String -> (String,Int) substWithFun :: Regex -> (Match->String) -> (Int->Bool) -> String -> (String,Int) substWithFunM :: (Monad m) => Regex -> (Match -> m String) -> (Int->Bool) -> String -> m (String,Int) Where a call to 'substWithPat pat rpat mode str' replaces matches of 'pat' in 'str' by 'rpat' and returns the resulting string and the number of replacements done. The 'rpat' replace pattern can contain backreferences on the form \m where \m refers to the mth subgroup in the corresponding match (\0 refers to the entire match). The call replaces only 'replaceable matches'. A match m is replaceable if its the nth match and (mode n) is true and, m is either the first match, a proper match or an empty match succeding an empty match. This schema gives results which are conformant with replace functionality in several other regex libraries, e.g. in Tcl, Python and Perl. For example, replacing matches of "_*" by "_" in "awk" gives "_a_w_k_", and replacing matches of "_*" by "_" in "sed_and_awk" gives "_s_e_d_a_n_d_a_w_k". (Compare the discussion in 'Mastering Regular Expressions', O'Reilly, pages 187-188.) The functions substWithFun and substWithFunM are obvious variations on the substWithPat function. 2. Constructing regular expressions. There is the well known problem that the backslash is used both as a string escape character and a regexp operator. I know of three approaches to the problem: a) Bite the bullet and, e.g. write regexps like "\\\\" in order to match a single backslash (e.g. as in emacs lisp). b) Use a language extensions for 'raw' strings where the backslash is not interpreted (e.g. /regex/ in awk, r"regex" in python and {regex} in Tcl). c) Use a different operator than the backslash in regular expressions, this has the benefit of not demanding a language extension, but is nonstandard on the negative side. There is also the problem with inserting string values in regular expressions. Appending with ++ is not particular convenient with complicated regexps because the result can be rather unreadable. I suppose we have to wait for a standard implementation of printf in template haskell for this problem. Cheers Per _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell