I had similar issues a while ago. It had to do with UTF-8 encoding as far
as I can recall.
I wanted to "wrap" a multiline string (code listings) within some pandoc
generated HTML of a hakyll page with a container "div". The text to wrap
would be determined using a PCRE regex.
Here the (probably inefficient) implementation:
module Transformations where
import Hakyll
import qualified Text.Regex.PCRE as RE
import qualified Data.ByteString.UTF8 as BSU
import qualified Data.ByteString as BS
-- Wraps numbered code listings within the page body with a div
-- in order to be able to apply some more specific styling.
wrapNumberedCodelistings (Page meta body) =
Page meta newBody
where
newBody = regexReplace' regex wrap body
regex = "]+>.*?"-
wrap x = "" ++ x ++ ""
-- Replace the whole string matched by the given
-- regex using the given replacement function (hopefully UTF8-aware)
regexReplace' :: String -> (String -> String) -> String -> String
regexReplace' pattern replace text = BSU.toString $ go textUTF8
where
patternUTF8 = BSU.fromString pattern
textUTF8 = BSU.fromString text
replaceUTF8 x = BSU.fromString $ replace $ BSU.toString x
regex = RE.makeRegexOpts compOpts RE.defaultExecOpt $
BSU.fromString pattern
compOpts = RE.compMultiline + RE.compDotAll + RE.compUTF8 +
RE.compNoUTF8Check
go part = case RE.matchM regex part of
Just (before, match, after) ->
BS.concat [before, replaceUTF8 match, go after]
_ -> part
The discussion back then was
http://www.haskell.org/pipermail/beginners/2012-June/010064.html
Hope this helps.
Best regards,
Rico Moorman
P.S. Sorry for the double email Magicloud ... didn't hit reply all at first
On Tue, Dec 18, 2012 at 10:43 AM, José Romildo Malaquias <
j.romi...@gmail.com> wrote:
> On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
> > Attachment is the test text file.
> > And I tested my regexp as this:
> >
> > Prelude> :m + Text.Regex.PCRE
> > Prelude Text.Regex.PCRE> z <- readFile "test.html"
> > Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ " > href=\"(.*?)\">.*? String,
> > [String])
> > Prelude Text.Regex.PCRE> b
> > ...
> > n of the Triumvirate\r\nDavid
> Rapoza\r\n
> >\r\n Return to Ravnica\r\n
> \r\n
> >10/31/2012\r\n \r\n > class=\"small\"><"
> > Prelude Text.Regex.PCRE> m
> > "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> > class=\"article-image\" "
> >
> > >From the value of b and m, it was weird that the matching was moved
> forward
> > by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a
> > and so on gave correct results. It was only the first matching that was
> > broken.
> > Tested with regex-posix (with modified regexp), everything is OK.
>
> I have a similar issue with non-ascii strings. It seems that the
> internal representation used by Haskell and pcre are different and one
> of them is counting bytes and the other is counting code points. So they
> diverge when a multi-byte representation (like utf8) is used.
>
> It has been reported previously. See these threads:
>
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029
>
> I am still waiting for a new release of regex-pcre that fixes this
> issue.
>
> Romildo
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe