Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-03-08 Thread Ganesh Sittampalam
On Tue, 24 Feb 2004 07:18:58 -0800 (PST), Hal Daume III [EMAIL PROTECTED]
wrote:

just as another sample point...

i write 99% of my code in either haskell or perl.  haskell tends to be for 
the longer programs, perl tends to be for the shorter ones, though the 
decision is primarily made for only one reason:

  - if the overhead to write the string processing code in haskell
is outweighed by the overall length of the program, use haskell.
otherwise, use perl.

i would be very very happy to abandon perl all together, but, for the most 
part, this isn't a niche haskell has been able to fit well in to yet.

Another sample point:

I hacked together a perl script to do a particular task in about 30 minutes,
including fixing algorithmic issues with the problem I wanted to solve.

I then thought I'd try porting it to Haskell; I started out by doing the
really dumb conversion of mutable variables to IORefs, hashes to FiniteMaps,
and Perl regular expressions to Text.Regex (i.e. GNU extended regexps). I'd
forgotten about this thread at the time, otherwise I might have tried one of
the cleverer options.

Some observations:

(1) It took me several hours to get it working. Mostly this was because
debugging was difficult - firstly, I got an unhelpful type error message
from GHC followed by problems with making the code I developed with GHC 5
work with GHC 6 so I could show someone else the problem. Then I had a
syntax error in one of my regular expressions, which led to a run-time error
with no information about which regular expression the error was in or where
the error was. Finally debugging semantic problems with the regular
expressions wasn't very pleasant.

(2) The code ran three times as slowly. Profiling it suggests that the time
is being wasted in the regexp matches; quite possibly the main cost is in
marshalling Haskell strings to C strings. The comments in Text.Regex.Posix
suggest a PackedString interface should be provided; I should try making one
and seeing if things are better.

(3) The code was twice as long. Mostly this was for obvious reasons; the
translation of mutable variables to IORefs leads to some overhead in reading
from them, and perl has nice syntax for manipulating hashes.

I don't really have any point, except that it would be nice if it hadn't
turned out that Perl was clearly the better choice :-/

Ganesh
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-26 Thread Ross Paterson
On Tue, Feb 24, 2004 at 05:24:41PM -, Simon Marlow wrote:
 [Graham Klyne wrote:]
  I recently ran into some problems porting some Haskell code 
  to Windows 
  because it used the Text.Regex library, which is dependent on 
  a Unix-only system.
 
 Text.Regex works fine on Windows, at least on GHC.  It seems that Hugs
 doesn't make it available, perhaps because it requires an auxiliary C
 regex engine, which we provide as part of GHC.  It may be that this code
 isn't as portable as the rest of Hugs, which is why they don't provide
 it, but I'll let the Huge folks comment on that.

It looks possible.  The only awkwardness I can see is telling the
Hugs package converter where to find regex.h
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


RE: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-25 Thread Graham Klyne
At 17:24 24/02/04 +, Simon Marlow wrote:
 At 18:07 23/02/04 -0800, John Meacham wrote:
 It provides regular and monadic versions, a very overloaded
 and useful
 interface, as well as extensibility. although currently the only
 instance  is based on Text.Regex, it generalizes to matching lists of
 arbitrary type, not just strings, and also leaves the door open for
 compile-time checked and optimized regular expressions via template
 Haskell.

 I have a concern here that I hope you won't see as a negative
 response to
 your ideas...

 I recently ran into some problems porting some Haskell code
 to Windows
 because it used the Text.Regex library, which is dependent on
 a Unix-only system.
Text.Regex works fine on Windows, at least on GHC.  It seems that Hugs
doesn't make it available, perhaps because it requires an auxiliary C
regex engine, which we provide as part of GHC.  It may be that this code
isn't as portable as the rest of Hugs, which is why they don't provide
it, but I'll let the Huge folks comment on that.
Ah, yes, maybe so.  The last communication I have on this is:
  http://www.haskell.org//pipermail/hugs-bugs/2004-January/001465.html
#g


Graham Klyne
For email:
http://www.ninebynine.org/#Contact
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Per Larsson
On Tuesday 24 February 2004 03.07, John Meacham wrote:
 Inspired by an idea by Andrew Pang and an old project of mine, I decided
 to fill out a reusable regular expression library which is similar to
 Perl's, but much more expressive.
  ...

Hi,
Thanks! I am grateful of your efforts because I have since long missed some 
typical text processing functionality in haskell. (Besides a more complete 
regular expression library I think many haskellers miss string constructors 
like an 'official' version of printf/format  and maybe also some sort of 
'here documents'.) Below follows some of my thoughts regarding a complete 
regex library, in the hope that this will be of any inspiration.

1. Replacement
A regex library must contain functions for replacement with regular 
expressions. One could think this is trivial to implement given a match 
function, but there are some tricky choices to be made regarding empty 
matches (this also applies to splitting a string into fields with a regexp). 
Also there is questions about the interface of these functions.
In my own Text.Regex wrapper I have the functions.

  data Match = Match {before :: String, after :: String, groups :: [String]}
  ...
  substWithPat :: Rexex - String - (Int - Bool) - String - (String,Int)
  substWithFun :: Regex - (Match-String) - (Int-Bool) - 
   String - (String,Int)
  substWithFunM :: (Monad m) = Regex - (Match - m String) - 
 (Int-Bool) - String - m (String,Int)

Where a call to 'substWithPat pat rpat mode str' replaces matches
of 'pat' in 'str' by 'rpat' and returns the resulting string and the number of 
replacements done. The 'rpat' replace pattern can contain backreferences on 
the form \m where \m refers to the mth subgroup in the corresponding match 
(\0 refers to the entire match). The call replaces only 'replaceable 
matches'.  A match m is replaceable if its the nth match and (mode n) is true 
and,  m is either the first match, a proper match or an empty match succeding 
an empty match. This schema gives results which are conformant with replace 
functionality in several other regex libraries, e.g. in Tcl, Python and Perl. 
For example, replacing matches of _* by _ in awk gives _a_w_k_, and 
replacing matches of _* by _ in sed_and_awk gives _s_e_d_a_n_d_a_w_k.  
(Compare the discussion in 'Mastering Regular Expressions', O'Reilly, pages 
187-188.) The functions substWithFun and substWithFunM are obvious variations 
on the substWithPat function.

2. Constructing regular expressions.
There is the well known problem that the backslash is used both as a string 
escape character and a regexp operator. I know of three approaches to the 
problem:
 a) Bite the bullet and, e.g. write regexps like  in order to match a
 
single backslash (e.g. as in emacs lisp).
b) Use a language extensions for 'raw' strings where the backslash is not 
interpreted (e.g. /regex/ in awk, rregex in python and {regex} in Tcl).
c) Use a different operator than the backslash in regular expressions, this 
has the benefit of not demanding a language extension, but is nonstandard on 
the negative side. 
There is also the problem with inserting string values in regular expressions. 
Appending with ++ is not particular convenient with complicated regexps 
because the result can be rather unreadable. I suppose we have to wait for a
standard implementation of printf in template haskell for this problem.

Cheers
Per









___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Johannes Waldmann
Per Larsson wrote:

 .. I have since long missed some
typical text processing functionality in haskell.
it is often the case that people process text
only because they have no better (structured and typed) way of
representing their data...
--
-- Johannes Waldmann,  Tel/Fax: (0341) 3076 6479 / 6480 --
-- http://www.imn.htwk-leipzig.de/~waldmann/ -
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Per Larsson
On Tuesday 24 February 2004 15.25, Johannes Waldmann wrote:
 Per Larsson wrote:
   .. I have since long missed some
 
  typical text processing functionality in haskell.

 it is often the case that people process text
 only because they have no better (structured and typed) way of
 representing their data...

I agree with your point in the general case. For serious parsing and pretty 
printing tasks there are great haskell libraries like Parsec and PrettyPrint.
But I can't see why the haskell user shouldn't also have access to concise 
text processing notations, e.g. regular expressions and printf, to be used 
in, e.g. short script-like programs?

Cheers 
Per 

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Hal Daume III
just as another sample point...

i write 99% of my code in either haskell or perl.  haskell tends to be for 
the longer programs, perl tends to be for the shorter ones, though the 
decision is primarily made for only one reason:

  - if the overhead to write the string processing code in haskell
is outweighed by the overall length of the program, use haskell.
otherwise, use perl.

i would be very very happy to abandon perl all together, but, for the most 
part, this isn't a niche haskell has been able to fit well in to yet.

 - hal

p.s., certainly this is at least somewhat unique to me, but almost all of 
the data i work with is unstructured text for two reasons.  first, that's 
how it naturally comes.  second, to throw xml or some other scheme on to 
it will balloon the data sizes to unmanagable amounts, with little gain.

On Tue, 24 Feb 2004, Johannes Waldmann wrote:

 Per Larsson wrote:
 
   .. I have since long missed some
  typical text processing functionality in haskell.
 
 it is often the case that people process text
 only because they have no better (structured and typed) way of
 representing their data...
 

-- 
 Hal Daume III   | [EMAIL PROTECTED]
 Arrest this man, he talks in maths.   | www.isi.edu/~hdaume

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Johannes Waldmann
Per Larsson wrote:

But I can't see why the haskell user shouldn't also have access to concise 
text processing notations, e.g. regular expressions and printf, 
I was not implying it should be forbidden,
rather I meant to give a reason why text processing
seems to be less common in typical Haskell programs.
 to be used in, e.g. short script-like programs?

sure. but why do we need text processing in scripting?
because the usual OS shells have `string' as their only data type.
we better change that :-) anyone for a fully typed Haskell shell?
--
-- Johannes Waldmann,  Tel/Fax: (0341) 3076 6479 / 6480 --
-- http://www.imn.htwk-leipzig.de/~waldmann/ -
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Peter Achten
At 16:33 24-2-04 +0100, Johannes Waldmann wrote:
Per Larsson wrote:

But I can't see why the haskell user shouldn't also have access to 
concise text processing notations, e.g. regular expressions and printf,
I was not implying it should be forbidden,
rather I meant to give a reason why text processing
seems to be less common in typical Haskell programs.
 to be used in, e.g. short script-like programs?

sure. but why do we need text processing in scripting?
because the usual OS shells have `string' as their only data type.
we better change that :-) anyone for a fully typed Haskell shell?
I am sorry, but I couldn't resist answering your question: have a look at 
Arjen van Weelden's Famke system [1,2], who has done exactly that (in Clean):

Regards,
Peter Achten
[1] Arjen van Weelden and Rinus Plasmeijer. Towards a Strongly Typed 
Functional Operating System. In Peña, R. ed. Proceedings 14th International 
Workshop on the Implementation of Functional Languages, IFL 2002, Selected 
Papers, Madrid, Spain, September 16-18, 2002, Springer Verlag, LNCS 2670.
ftp://ftp.cs.kun.nl/pub/Clean/papers/2003/vWeA2003-Famke.pdf

[2] Arjen van Weelden and Rinus Plasmeijer. A Functional Shell that 
Dynamically Combines Compiled Code. Submitted to Selected Papers Review of 
Proceedings 15th International Workshop on the Implementation of Functional 
Languages, IFL 2003, Edinburgh, Scotland, September 8-10, 2003
ftp://ftp.cs.kun.nl/pub/Clean/papers/2003/vWeA2003-Esther.pdf

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


RE: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Simon Marlow
 
 At 18:07 23/02/04 -0800, John Meacham wrote:
 It provides regular and monadic versions, a very overloaded 
 and useful
 interface, as well as extensibility. although currently the only
 instance  is based on Text.Regex, it generalizes to matching lists of
 arbitrary type, not just strings, and also leaves the door open for
 compile-time checked and optimized regular expressions via template
 Haskell.
 
 I have a concern here that I hope you won't see as a negative 
 response to 
 your ideas...
 
 I recently ran into some problems porting some Haskell code 
 to Windows 
 because it used the Text.Regex library, which is dependent on 
 a Unix-only system.

Text.Regex works fine on Windows, at least on GHC.  It seems that Hugs
doesn't make it available, perhaps because it requires an auxiliary C
regex engine, which we provide as part of GHC.  It may be that this code
isn't as portable as the rest of Hugs, which is why they don't provide
it, but I'll let the Huge folks comment on that.

Cheers,
Simon
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread Glynn Clements

Hal Daume III wrote:

 p.s., certainly this is at least somewhat unique to me, but almost all of 
 the data i work with is unstructured text for two reasons.  first, that's 
 how it naturally comes.  second, to throw xml or some other scheme on to 
 it will balloon the data sizes to unmanagable amounts, with little gain.

There's a pretty big gap between *unstructured* text and e.g. XML. 
Most of what fits into that gap is essentially structured text.

If you're performing some kind of processing on the text, the odds are
that it does actually have some degree of structure to it.

My experience of code which does ad-hoc text processing using regexps
or similar is that a lot of it only handles a subset of what it ought
to, and that subset is typically defined by the nature of the
technique. Some examples of this issue are code which attempts to:

+ match C-style string literals, but falls down on an embedded \
sequence;

+ match code tokens, but matches the same sequence of characters when
they occur inside string literals;

+ process email headers, but falls down on folded headers;

+ process HTML, but falls down in more ways than I could possibly
list.

Except in the most trivial cases, to process text *reliably* you
usually need to at least tokenise it and process the token stream. And
anything which has a more complex structure usually needs to operate
(at least conceptually) on a parse tree.

Regexps certainly have their place, although that's primarily in
writing tokenisers. IMHO, try to do everything (or, at least, too
much) using s/pattern/replacement/ constructs seems to be a favourite
recipe for buggy code.

Case in point: the regular occurrence of cross-site scripting, SQL
injection, printf() and similar issues on lists such as BugTraq.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread John Meacham
I have updated my code as well as made a simple template haskell
version.

A Wikified homepage can be found at
 http://haskell.org/hawiki/RegexSyntax

I have found that the syntax combines really really really well with
pattern guards.

f s  | Just 3 - s =~~ i/john = print exactly 3 johns found.
 | False - s =~ ba*r = print no baaars found
 | Just a - s =~~ ([a-z]*)/(.*)/([a-z]*) = print $ 
  concat [a!1 ++ a!3,  are the flags. , a!2,  is the expression]
f _ = print nothing found!

I am still tweaking the code so expect it to change without notice over
the next few days.

the template haskell version can be used by importing RegexTH then doing
 s =~ $(mkRE i/john) 
right now, all it does is check syntax and give an error if the regular
expression is invalid at compile time, but in the future, it will
probably build optimized matchers at compile time. 

A problem is still that lots of type annotations are needed to resolve
the overloading, but it is still okay to use, if anyone has any
suggestions on mitigating this, then that would be good. something i
have found to be useful, is to use the minimal return type necessary for
the task at hand, since any unmatched values will probably need type
annotations.

John

-- 
---
John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED]
---
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-24 Thread John Meacham
On Tue, Feb 24, 2004 at 09:01:46AM +, Graham Klyne wrote:
 I recently ran into some problems porting some Haskell code to Windows 
 because it used the Text.Regex library, which is dependent on a Unix-only 
 system.  If Haskell is to embrace the use of Regexes to a level comparable 
 with (say) Perl, I think it is important that the underlying library is 
 available on all platforms on which Haskell is supported, or I fear that a 
 large swathe of Haskell support libraries will be unavailable to programs 
 running on Windows.
 
 So my point would be that, to be most useful to the Haskell community, your 
 effort should also be supported by a Text.Regex implementation (or 
 alternative) for Windows systems.

My code mainly defines a re-usable syntax and includes a 'sample'
implementation based on Text.Regex. I expect there will be other more
interesting implementations such as a pure haskell version and a PCRE
one. I am sure something that works on windows will be available if the
library becomes at all popular.

Perhaps implementations isn't the best word, since their features might
vary wildly. (PCRE vs. posix) for example.
John

-- 
---
John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED]
---
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


[Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-23 Thread John Meacham
Inspired by an idea by Andrew Pang and an old project of mine, I decided
to fill out a reusable regular expression library which is similar to
Perl's, but much more expressive.

It provides regular and monadic versions, a very overloaded and useful
interface, as well as extensibility. although currently the only
instance  is based on Text.Regex, it generalizes to matching lists of
arbitrary type, not just strings, and also leaves the door open for
compile-time checked and optimized regular expressions via template
Haskell.

so, does this seem interesting? I am really enjoying the =~ syntax
already in projects, and the monadic version is great for concise ad-hoc
parsers. 

my next steps are to 
1. finish my template Haskell regex compiler, not only to optimize at
compile time, but allow matching of arbitrary types. not just strings.
(perhaps reusing some code from the old Regular Expressions in Haskell
project)
2. make a pcre binding. hook it in.

and possible ideas for the future are
 * remove restriction to lists, so tree-like structures can be matched.
perhaps grabbing default value from Monoid. The only reason I don't
have this now is I didn't want to enable overlapping-instances,
things are already complicated enough :)
 * add further optimized versions of match*, perhaps split up class for 
when there is no clear concept of the preceding part and postceeding
parts around a match..
 * develop a substitution syntax
 * implement pcre in pure Haskell

although, even now, it seems quite useful.

new versions will appear at
http://repetae.net/john/computer/haskell/RegexSyntax.hs

plus, this is something to brag about to Perl people, they can only do
things differently based on scalar vs. list context. our =~ does 10
different things typesafely and probably some more I missed :)

{-# OPTIONS -fglasgow-exts  #-}

module RegexSyntax(RegexLike(..), RegexLikeImp(..), RegexContext(..), (!~)) where

import Array
import Text.Regex
import Maybe

{-

basic usage:

 string =~ regular expression 
returns different things depending on context

type - what it evaluates to
---
Int - number of times the regular expression matches
String -  matching portion of string
(String,String,String) - (text before match, matching text, text after match)
[Either String String] - list of matching and nonmatching strings, if concated,
  the original string results.  Left = notmatching, Right = matching.
Bool - whether the string matches
() - always returns () (useful in monad context, see below)
[String] - list of matches
Array Int String - list of substring matches for first match 
(String, Array Int String) - full matching text and substring matches
[(String, Array Int String)] - all matches, full match plus substrings
[Array Int String] - all substrings from all matches

also, there is the monadic version (=~~) which always behaves exactly the same
as (=~) except when the match fails, instead of returning a default value, the
monad fails. 

s !~ re = not (s =~ re) for convinience

regular expressions:

these may be strings, which are interpreted as regular expressions, or Regex's
from the Text.Regex module. or any other instance of the RegexLike class. 

when using strings, you may prefix the regex by i/ for a case-insensitive
match and s/ to treat the string as a single line. (or both as si/)
A leading / is ignored, other than these cases / is not special.

advanced features:

not just strings can be matched, but rather lists of anything a matcher is
defined for.  RegexLikeImp data class can be used for in-place code generated
by template haskell for compile-time checked regular expresions


-}

class RegexLike r a | r - a where
matchOnce :: r  - [a] - Maybe ([a],[a],[a],Array Int [a])
matchTest :: r  - [a] - Bool
matchAll  :: r  - [a] - [Either [a] ([a],Array Int [a])]
matchShow :: r - String  -- for error messages
matchTest r xs = isJust (matchOnce r xs)
matchAll r xs = case matchOnce r xs of
Nothing - pn xs []
Just (p,m,rest,as) - pn p (Right (m,as):matchAll r rest)
  where pn x = if null x then id else (Left x:) 
matchShow _ = Unknown


instance RegexLike Regex Char where 
matchOnce re xs = fmap f (matchRegexAll re xs) where
f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls)
matchShow _ = Regex

instance RegexLike String Char where 
matchOnce re xs = fmap f (matchRegexAll (mr re) xs) where
f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls)
mr ('i':'/':re) = mkRegexWithOpts re True False
mr ('s':'/':re) = mkRegexWithOpts re False True
mr ('i':'s':'/':re)  = mkRegexWithOpts re False False
mr ('s':'i':'/':re)  = mkRegexWithOpts re False False
mr ('/':re) = mkRegex re 
mr (re) = mkRegex re 
matchShow re = re


class RegexContext x a where
(=~) :: RegexLike r x = [x] - r - a
(=~~) :: (Monad m, RegexLike r x) = [x] - r - m a
-- s =~~ re = return (s 

Re: [Haskell] regular expression syntax - perl ain't got nothin on haskell

2004-02-23 Thread John Meacham
Excuse me, I meant Andre Pang. not Andrew Pang. It's not enough I steal
his good idea, but then I get his name wrong.  :)
John
-- 
---
John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED]
---
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell