[Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Pieter Laeremans
Hi, I 've got a lot of files which I need to proces in order to make them indexable by sphinx. The files contain the data of a website with a custom perl based cms. Unfortunatly they sometimes contain xml/html tags like i And since most of the texts are in dutch and some are in French they also

Re: [Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Jeremy Shaw
At Tue, 5 Aug 2008 23:21:43 +0200, Pieter Laeremans wrote: And is there some haskell function which converts special tokens lik - amp; and é - egu; ? By default, xml only has 5 predefined entities: quot, amp, apos, lt, and gt. Any additional ones are defined in the DTD. But you can *always*

Re: [Haskell-cafe] Stripping text of xml tags and special symbols

2008-08-05 Thread Benja Fallenstein
Hi Pieter, 2008/8/5 Pieter Laeremans [EMAIL PROTECTED]: But the sphinx indexer complains that the xml isn't valid. When I look at the errors this seems due to some documents containing not well formed html. If you need to cope with non-well-formed HTML, try HTML Tidy: