[Haskell-cafe] Stripping text of xml tags and special symbols
Hi, I 've got a lot of files which I need to proces in order to make them indexable by sphinx. The files contain the data of a website with a custom perl based cms. Unfortunatly they sometimes contain xml/html tags like i And since most of the texts are in dutch and some are in French they also contain a lot of special characters like ë é, ... I'm trying to replace the custom based perl based cms by a haskell one. And I would like to add search capability. Since someone wrote sphinx bindings a few weeks ago I thought I try that. But transforming the files in something that sphinx seems a challenge. Most special character problems seem to go aways when I use encodeString (Codec.Binary.UTF8.String) on the indexable data. But the sphinx indexer complains that the xml isn't valid. When I look at the errors this seems due to some documents containing not well formed html. I would like to use a programmatic solution to this problem. And is there some haskell function which converts special tokens lik - amp; and é - egu; ? thanks in advance, Pieter -- Pieter Laeremans [EMAIL PROTECTED] The future is here. It's just not evenly distributed yet. W. Gibson ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Stripping text of xml tags and special symbols
At Tue, 5 Aug 2008 23:21:43 +0200, Pieter Laeremans wrote: And is there some haskell function which converts special tokens lik - amp; and é - egu; ? By default, xml only has 5 predefined entities: quot, amp, apos, lt, and gt. Any additional ones are defined in the DTD. But you can *always* use numeric character references like: #; or #x; So, you should be able to implement a simple function which whitelists a few characters ('a'..'z', 'A'..'Z', '0'..'9', ...), and encodes everything else? You might look at the source code for Text.XML.HaXml.Escape and Network.URI.escapeString for inspiration. j. http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Stripping text of xml tags and special symbols
Hi Pieter, 2008/8/5 Pieter Laeremans [EMAIL PROTECTED]: But the sphinx indexer complains that the xml isn't valid. When I look at the errors this seems due to some documents containing not well formed html. If you need to cope with non-well-formed HTML, try HTML Tidy: http://tidy.sourceforge.net/ All the best, - Benja ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe