[Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Pieter Laeremans
Hi,
I  've got a lot of files which I need to proces in order to make them
indexable by sphinx.
The files contain the data of a website with a custom perl based cms.
 Unfortunatly they sometimes contain xml/html tags like i

And since most of the texts are in dutch and some are in French they also
contain a lot of special characters like ë é, ...

I'm trying to replace the custom based perl based cms by a haskell one.  And
I would like to add search capability. Since someone wrote sphinx
bindings a few weeks ago I thought I try that.

But transforming the files in something that sphinx seems a challenge.  Most
special character problems seem to go aways when I  use encodeString
(Codec.Binary.UTF8.String)
on the indexable data.

But the sphinx indexer complains that the xml isn't valid.  When I look at
the errors this seems due to some documents containing not well formed
 html.
I would like to use a programmatic solution to this problem.

And is there some haskell function which converts special tokens lik  -
amp; and é - egu; ?

thanks in advance,

Pieter



-- 
Pieter Laeremans [EMAIL PROTECTED]

The future is here. It's just not evenly distributed yet. W. Gibson
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Jeremy Shaw
At Tue, 5 Aug 2008 23:21:43 +0200,
Pieter Laeremans wrote:

 And is there some haskell function which converts special tokens lik  -
 amp; and é - egu; ?

By default, xml only has 5 predefined entities: quot, amp, apos, lt,
and gt. Any additional ones are defined in the DTD.

But you can *always* use numeric character references like:

#; 
or
#x;

So, you should be able to implement a simple function which whitelists
a few characters ('a'..'z', 'A'..'Z', '0'..'9', ...), and encodes
everything else?

You might look at the source code for Text.XML.HaXml.Escape and
Network.URI.escapeString for inspiration.

j.

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Benja Fallenstein
Hi Pieter,

2008/8/5 Pieter Laeremans [EMAIL PROTECTED]:
 But the sphinx indexer complains that the xml isn't valid.  When I look at
 the errors this seems due to some documents containing not well formed
  html.

If you need to cope with non-well-formed HTML, try HTML Tidy:

http://tidy.sourceforge.net/

All the best,
- Benja
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe