Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
On Tue, 24 Feb 2004 07:18:58 -0800 (PST), Hal Daume III <[EMAIL PROTECTED]> wrote: >just as another sample point... > >i write 99% of my code in either haskell or perl. haskell tends to be for >the longer programs, perl tends to be for the shorter ones, though the >decision is primarily made for only one reason: > > - if the overhead to write the string processing code in haskell >is outweighed by the overall length of the program, use haskell. >otherwise, use perl. > >i would be very very happy to abandon perl all together, but, for the most >part, this isn't a niche haskell has been able to fit well in to yet. Another sample point: I hacked together a perl script to do a particular task in about 30 minutes, including fixing algorithmic issues with the problem I wanted to solve. I then thought I'd try porting it to Haskell; I started out by doing the really dumb conversion of mutable variables to IORefs, hashes to FiniteMaps, and Perl regular expressions to Text.Regex (i.e. GNU extended regexps). I'd forgotten about this thread at the time, otherwise I might have tried one of the cleverer options. Some observations: (1) It took me several hours to get it working. Mostly this was because debugging was difficult - firstly, I got an unhelpful type error message from GHC followed by problems with making the code I developed with GHC 5 work with GHC 6 so I could show someone else the problem. Then I had a syntax error in one of my regular expressions, which led to a run-time error with no information about which regular expression the error was in or where the error was. Finally debugging semantic problems with the regular expressions wasn't very pleasant. (2) The code ran three times as slowly. Profiling it suggests that the time is being wasted in the regexp matches; quite possibly the main cost is in marshalling Haskell strings to C strings. The comments in Text.Regex.Posix suggest a PackedString interface should be provided; I should try making one and seeing if things are better. (3) The code was twice as long. Mostly this was for obvious reasons; the translation of mutable variables to IORefs leads to some overhead in reading from them, and perl has nice syntax for manipulating hashes. I don't really have any point, except that it would be nice if it hadn't turned out that Perl was clearly the better choice :-/ Ganesh ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
On Tue, Feb 24, 2004 at 05:24:41PM -, Simon Marlow wrote: > [Graham Klyne wrote:] > > I recently ran into some problems porting some Haskell code > > to Windows > > because it used the Text.Regex library, which is dependent on > > a Unix-only system. > > Text.Regex works fine on Windows, at least on GHC. It seems that Hugs > doesn't make it available, perhaps because it requires an auxiliary C > regex engine, which we provide as part of GHC. It may be that this code > isn't as portable as the rest of Hugs, which is why they don't provide > it, but I'll let the Huge folks comment on that. It looks possible. The only awkwardness I can see is telling the Hugs package converter where to find regex.h ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: [Haskell] regular expression syntax - perl ain't got nothin on haskell
At 17:24 24/02/04 +, Simon Marlow wrote: > At 18:07 23/02/04 -0800, John Meacham wrote: > >It provides regular and monadic versions, a very overloaded > and useful > >interface, as well as extensibility. although currently the only > >instance is based on Text.Regex, it generalizes to matching lists of > >arbitrary type, not just strings, and also leaves the door open for > >compile-time checked and optimized regular expressions via template > >Haskell. > > I have a concern here that I hope you won't see as a negative > response to > your ideas... > > I recently ran into some problems porting some Haskell code > to Windows > because it used the Text.Regex library, which is dependent on > a Unix-only system. Text.Regex works fine on Windows, at least on GHC. It seems that Hugs doesn't make it available, perhaps because it requires an auxiliary C regex engine, which we provide as part of GHC. It may be that this code isn't as portable as the rest of Hugs, which is why they don't provide it, but I'll let the Huge folks comment on that. Ah, yes, maybe so. The last communication I have on this is: http://www.haskell.org//pipermail/hugs-bugs/2004-January/001465.html #g Graham Klyne For email: http://www.ninebynine.org/#Contact ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
On Tue, Feb 24, 2004 at 09:01:46AM +, Graham Klyne wrote: > I recently ran into some problems porting some Haskell code to Windows > because it used the Text.Regex library, which is dependent on a Unix-only > system. If Haskell is to embrace the use of Regexes to a level comparable > with (say) Perl, I think it is important that the underlying library is > available on all platforms on which Haskell is supported, or I fear that a > large swathe of Haskell support libraries will be unavailable to programs > running on Windows. > > So my point would be that, to be most useful to the Haskell community, your > effort should also be supported by a Text.Regex implementation (or > alternative) for Windows systems. My code mainly defines a re-usable syntax and includes a 'sample' implementation based on Text.Regex. I expect there will be other more interesting implementations such as a pure haskell version and a PCRE one. I am sure something that works on windows will be available if the library becomes at all popular. Perhaps implementations isn't the best word, since their features might vary wildly. (PCRE vs. posix) for example. John -- --- John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED] --- ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
I have updated my code as well as made a simple template haskell version. A Wikified homepage can be found at http://haskell.org/hawiki/RegexSyntax I have found that the syntax combines really really really well with pattern guards. f s | Just 3 <- s =~~ "i/john" = print "exactly 3 johns found." | False <- s =~ "ba*r" = print "no baaars found" | Just a <- s =~~ "([a-z]*)/(.*)/([a-z]*)" = print $ concat [a!1 ++ a!3, " are the flags. ", a!2, " is the expression"] f _ = print "nothing found!" I am still tweaking the code so expect it to change without notice over the next few days. the template haskell version can be used by importing RegexTH then doing s =~ $(mkRE "i/john") right now, all it does is check syntax and give an error if the regular expression is invalid at compile time, but in the future, it will probably build optimized matchers at compile time. A problem is still that lots of type annotations are needed to resolve the overloading, but it is still okay to use, if anyone has any suggestions on mitigating this, then that would be good. something i have found to be useful, is to use the minimal return type necessary for the task at hand, since any unmatched values will probably need type annotations. John -- --- John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED] --- ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
Hal Daume III wrote: > p.s., certainly this is at least somewhat unique to me, but almost all of > the data i work with is unstructured text for two reasons. first, that's > how it naturally comes. second, to throw xml or some other scheme on to > it will balloon the data sizes to unmanagable amounts, with little gain. There's a pretty big gap between *unstructured* text and e.g. XML. Most of what fits into that gap is essentially structured text. If you're performing some kind of processing on the text, the odds are that it does actually have some degree of structure to it. My experience of code which does ad-hoc text processing using regexps or similar is that a lot of it only handles a subset of what it ought to, and that subset is typically defined by the nature of the technique. Some examples of this issue are code which attempts to: + match C-style string literals, but falls down on an embedded \" sequence; + match code tokens, but matches the same sequence of characters when they occur inside string literals; + process email headers, but falls down on folded headers; + process HTML, but falls down in more ways than I could possibly list. Except in the most trivial cases, to process text *reliably* you usually need to at least tokenise it and process the token stream. And anything which has a more complex structure usually needs to operate (at least conceptually) on a parse tree. Regexps certainly have their place, although that's primarily in writing tokenisers. IMHO, try to do everything (or, at least, too much) using s/pattern/replacement/ constructs seems to be a favourite recipe for buggy code. Case in point: the regular occurrence of cross-site scripting, SQL injection, printf() and similar issues on lists such as BugTraq. -- Glynn Clements <[EMAIL PROTECTED]> ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: [Haskell] regular expression syntax - perl ain't got nothin on haskell
> At 18:07 23/02/04 -0800, John Meacham wrote: > >It provides regular and monadic versions, a very overloaded > and useful > >interface, as well as extensibility. although currently the only > >instance is based on Text.Regex, it generalizes to matching lists of > >arbitrary type, not just strings, and also leaves the door open for > >compile-time checked and optimized regular expressions via template > >Haskell. > > I have a concern here that I hope you won't see as a negative > response to > your ideas... > > I recently ran into some problems porting some Haskell code > to Windows > because it used the Text.Regex library, which is dependent on > a Unix-only system. Text.Regex works fine on Windows, at least on GHC. It seems that Hugs doesn't make it available, perhaps because it requires an auxiliary C regex engine, which we provide as part of GHC. It may be that this code isn't as portable as the rest of Hugs, which is why they don't provide it, but I'll let the Huge folks comment on that. Cheers, Simon ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
At 16:33 24-2-04 +0100, Johannes Waldmann wrote: Per Larsson wrote: But I can't see why the haskell user shouldn't also have access to concise text processing notations, e.g. regular expressions and printf, I was not implying it should be forbidden, rather I meant to give a reason why text processing seems to be less common in typical Haskell programs. > to be used in, e.g. short script-like programs? sure. but why do we need text processing in scripting? because the usual OS shells have `string' as their only data type. we better change that :-) anyone for a fully typed Haskell shell? I am sorry, but I couldn't resist answering your question: have a look at Arjen van Weelden's Famke system [1,2], who has done exactly that (in Clean): Regards, Peter Achten [1] Arjen van Weelden and Rinus Plasmeijer. Towards a Strongly Typed Functional Operating System. In Peña, R. ed. Proceedings 14th International Workshop on the Implementation of Functional Languages, IFL 2002, Selected Papers, Madrid, Spain, September 16-18, 2002, Springer Verlag, LNCS 2670. ftp://ftp.cs.kun.nl/pub/Clean/papers/2003/vWeA2003-Famke.pdf [2] Arjen van Weelden and Rinus Plasmeijer. A Functional Shell that Dynamically Combines Compiled Code. Submitted to Selected Papers Review of Proceedings 15th International Workshop on the Implementation of Functional Languages, IFL 2003, Edinburgh, Scotland, September 8-10, 2003 ftp://ftp.cs.kun.nl/pub/Clean/papers/2003/vWeA2003-Esther.pdf ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
Per Larsson wrote: But I can't see why the haskell user shouldn't also have access to concise text processing notations, e.g. regular expressions and printf, I was not implying it should be forbidden, rather I meant to give a reason why text processing seems to be less common in typical Haskell programs. > to be used in, e.g. short script-like programs? sure. but why do we need text processing in scripting? because the usual OS shells have `string' as their only data type. we better change that :-) anyone for a fully typed Haskell shell? -- -- Johannes Waldmann, Tel/Fax: (0341) 3076 6479 / 6480 -- -- http://www.imn.htwk-leipzig.de/~waldmann/ - ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
just as another sample point... i write 99% of my code in either haskell or perl. haskell tends to be for the longer programs, perl tends to be for the shorter ones, though the decision is primarily made for only one reason: - if the overhead to write the string processing code in haskell is outweighed by the overall length of the program, use haskell. otherwise, use perl. i would be very very happy to abandon perl all together, but, for the most part, this isn't a niche haskell has been able to fit well in to yet. - hal p.s., certainly this is at least somewhat unique to me, but almost all of the data i work with is unstructured text for two reasons. first, that's how it naturally comes. second, to throw xml or some other scheme on to it will balloon the data sizes to unmanagable amounts, with little gain. On Tue, 24 Feb 2004, Johannes Waldmann wrote: > Per Larsson wrote: > > > .. I have since long missed some > > typical text processing functionality in haskell. > > it is often the case that people process "text" > only because they have no better (structured and typed) way of > representing their data... > -- Hal Daume III | [EMAIL PROTECTED] "Arrest this man, he talks in maths." | www.isi.edu/~hdaume ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
On Tuesday 24 February 2004 15.25, Johannes Waldmann wrote: > Per Larsson wrote: > > .. I have since long missed some > > > > typical text processing functionality in haskell. > > it is often the case that people process "text" > only because they have no better (structured and typed) way of > representing their data... I agree with your point in the general case. For serious parsing and pretty printing tasks there are great haskell libraries like Parsec and PrettyPrint. But I can't see why the haskell user shouldn't also have access to concise text processing notations, e.g. regular expressions and printf, to be used in, e.g. short script-like programs? Cheers Per ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
Per Larsson wrote: > .. I have since long missed some typical text processing functionality in haskell. it is often the case that people process "text" only because they have no better (structured and typed) way of representing their data... -- -- Johannes Waldmann, Tel/Fax: (0341) 3076 6479 / 6480 -- -- http://www.imn.htwk-leipzig.de/~waldmann/ - ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
On Tuesday 24 February 2004 03.07, John Meacham wrote: > Inspired by an idea by Andrew Pang and an old project of mine, I decided > to fill out a reusable regular expression library which is similar to > Perl's, but much more expressive. > ... Hi, Thanks! I am grateful of your efforts because I have since long missed some typical text processing functionality in haskell. (Besides a more complete regular expression library I think many haskellers miss string constructors like an 'official' version of printf/format and maybe also some sort of 'here documents'.) Below follows some of my thoughts regarding a complete regex library, in the hope that this will be of any inspiration. 1. Replacement A regex library must contain functions for replacement with regular expressions. One could think this is trivial to implement given a match function, but there are some tricky choices to be made regarding empty matches (this also applies to splitting a string into fields with a regexp). Also there is questions about the interface of these functions. In my own Text.Regex wrapper I have the functions. data Match = Match {before :: String, after :: String, groups :: [String]} ... substWithPat :: Rexex -> String -> (Int -> Bool) -> String -> (String,Int) substWithFun :: Regex -> (Match->String) -> (Int->Bool) -> String -> (String,Int) substWithFunM :: (Monad m) => Regex -> (Match -> m String) -> (Int->Bool) -> String -> m (String,Int) Where a call to 'substWithPat pat rpat mode str' replaces matches of 'pat' in 'str' by 'rpat' and returns the resulting string and the number of replacements done. The 'rpat' replace pattern can contain backreferences on the form \m where \m refers to the mth subgroup in the corresponding match (\0 refers to the entire match). The call replaces only 'replaceable matches'. A match m is replaceable if its the nth match and (mode n) is true and, m is either the first match, a proper match or an empty match succeding an empty match. This schema gives results which are conformant with replace functionality in several other regex libraries, e.g. in Tcl, Python and Perl. For example, replacing matches of "_*" by "_" in "awk" gives "_a_w_k_", and replacing matches of "_*" by "_" in "sed_and_awk" gives "_s_e_d_a_n_d_a_w_k". (Compare the discussion in 'Mastering Regular Expressions', O'Reilly, pages 187-188.) The functions substWithFun and substWithFunM are obvious variations on the substWithPat function. 2. Constructing regular expressions. There is the well known problem that the backslash is used both as a string escape character and a regexp operator. I know of three approaches to the problem: a) Bite the bullet and, e.g. write regexps like "" in order to match a single backslash (e.g. as in emacs lisp). b) Use a language extensions for 'raw' strings where the backslash is not interpreted (e.g. /regex/ in awk, r"regex" in python and {regex} in Tcl). c) Use a different operator than the backslash in regular expressions, this has the benefit of not demanding a language extension, but is nonstandard on the negative side. There is also the problem with inserting string values in regular expressions. Appending with ++ is not particular convenient with complicated regexps because the result can be rather unreadable. I suppose we have to wait for a standard implementation of printf in template haskell for this problem. Cheers Per ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
At 18:07 23/02/04 -0800, John Meacham wrote: It provides regular and monadic versions, a very overloaded and useful interface, as well as extensibility. although currently the only instance is based on Text.Regex, it generalizes to matching lists of arbitrary type, not just strings, and also leaves the door open for compile-time checked and optimized regular expressions via template Haskell. I have a concern here that I hope you won't see as a negative response to your ideas... I recently ran into some problems porting some Haskell code to Windows because it used the Text.Regex library, which is dependent on a Unix-only system. If Haskell is to embrace the use of Regexes to a level comparable with (say) Perl, I think it is important that the underlying library is available on all platforms on which Haskell is supported, or I fear that a large swathe of Haskell support libraries will be unavailable to programs running on Windows. So my point would be that, to be most useful to the Haskell community, your effort should also be supported by a Text.Regex implementation (or alternative) for Windows systems. #g Graham Klyne For email: http://www.ninebynine.org/#Contact ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell
Excuse me, I meant Andre Pang. not Andrew Pang. It's not enough I steal his good idea, but then I get his name wrong. :) John -- --- John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED] --- ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
[Haskell] regular expression syntax - perl ain't got nothin on haskell
Inspired by an idea by Andrew Pang and an old project of mine, I decided to fill out a reusable regular expression library which is similar to Perl's, but much more expressive. It provides regular and monadic versions, a very overloaded and useful interface, as well as extensibility. although currently the only instance is based on Text.Regex, it generalizes to matching lists of arbitrary type, not just strings, and also leaves the door open for compile-time checked and optimized regular expressions via template Haskell. so, does this seem interesting? I am really enjoying the =~ syntax already in projects, and the monadic version is great for concise ad-hoc parsers. my next steps are to 1. finish my template Haskell regex compiler, not only to optimize at compile time, but allow matching of arbitrary types. not just strings. (perhaps reusing some code from the old Regular Expressions in Haskell project) 2. make a pcre binding. hook it in. and possible ideas for the future are * remove restriction to lists, so tree-like structures can be matched. perhaps grabbing default value from Monoid. The only reason I don't have this now is I didn't want to enable overlapping-instances, things are already complicated enough :) * add further optimized versions of match*, perhaps split up class for when there is no clear concept of the preceding part and postceeding parts around a match.. * develop a substitution syntax * implement pcre in pure Haskell although, even now, it seems quite useful. new versions will appear at http://repetae.net/john/computer/haskell/RegexSyntax.hs plus, this is something to brag about to Perl people, they can only do things differently based on scalar vs. list context. our =~ does 10 different things typesafely and probably some more I missed :) {-# OPTIONS -fglasgow-exts #-} module RegexSyntax(RegexLike(..), RegexLikeImp(..), RegexContext(..), (!~)) where import Array import Text.Regex import Maybe {- basic usage: > string =~ "regular expression" returns different things depending on context type - what it evaluates to --- Int - number of times the regular expression matches String - matching portion of string (String,String,String) - (text before match, matching text, text after match) [Either String String] - list of matching and nonmatching strings, if concated, the original string results. Left = notmatching, Right = matching. Bool - whether the string matches () - always returns () (useful in monad context, see below) [String] - list of matches Array Int String - list of substring matches for first match (String, Array Int String) - full matching text and substring matches [(String, Array Int String)] - all matches, full match plus substrings [Array Int String] - all substrings from all matches also, there is the monadic version (=~~) which always behaves exactly the same as (=~) except when the match fails, instead of returning a default value, the monad fails. s !~ re = not (s =~ re) for convinience regular expressions: these may be strings, which are interpreted as regular expressions, or Regex's from the Text.Regex module. or any other instance of the RegexLike class. when using strings, you may prefix the regex by "i/" for a case-insensitive match and "s/" to treat the string as a single line. (or both as "si/") A leading "/" is ignored, other than these cases "/" is not special. advanced features: not just strings can be matched, but rather lists of anything a matcher is defined for. RegexLikeImp data class can be used for in-place code generated by template haskell for compile-time checked regular expresions -} class RegexLike r a | r -> a where matchOnce :: r -> [a] -> Maybe ([a],[a],[a],Array Int [a]) matchTest :: r -> [a] -> Bool matchAll :: r -> [a] -> [Either [a] ([a],Array Int [a])] matchShow :: r -> String -- for error messages matchTest r xs = isJust (matchOnce r xs) matchAll r xs = case matchOnce r xs of Nothing -> pn xs [] Just (p,m,rest,as) -> pn p (Right (m,as):matchAll r rest) where pn x = if null x then id else (Left x:) matchShow _ = "Unknown" instance RegexLike Regex Char where matchOnce re xs = fmap f (matchRegexAll re xs) where f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls) matchShow _ = "Regex" instance RegexLike String Char where matchOnce re xs = fmap f (matchRegexAll (mr re) xs) where f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls) mr ('i':'/':re) = mkRegexWithOpts re True False mr ('s':'/':re) = mkRegexWithOpts re False True mr ('i':'s':'/':re) = mkRegexWithOpts re False False mr ('s':'i':'/':re) = mkRegexWithOpts re False False mr ('/':re) = mkRegex re mr (re) = mkRegex re matchShow re = re class RegexContext x a where (=~) :: RegexLike r x => [x] -> r -> a (=~~) :: (Monad m, RegexLike r x) => [x] -> r -> m