Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Neil Mitchell Mon, 15 Nov 2010 08:46:59 -0800

> I've been working on a project that requires me to do screen scraping.


If you are screen scraping HTML I think tagsoup is a very good choice.
The use of tagsoup means that you have a real HTML 5 compliant parser
underneath, and then you can use whatever technique you wish to split
up the page text - and regular expressions/parsec might be a
reasonable choice. I've written lots of screen scraping stuff with
tagsoup, and it's usually very easy - the manual even walks you
through a couple of examples:
http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm

> He's very experienced, and comes from
> a Perl perspective. I let him into what I was doing, and he opined I
> should be using pcre.

When all you have is a hammer, everything looks like a thumb.
Structured manipulation of algebraic data types is trivial in Haskell,
and much less natural in Perl, so they use different techniques in
different places.

> So now I'm second guessing my choices. Why do
> people choose not to use regex for uri parsing?

If you mean HTML parsing, then it's because it's a nightmare to get
right, and people on the web do all kinds of crazy stuff. A correct
regular expression to match an HTML tag is lots of work. Given that
it's a solved problem, why go to all that effort. It is possible to do
with regular expressions, but not pleasant.

Thanks, Neil
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Reply via email to