Re: [Haskell-cafe] A weird bug of regex-pcre

2012-12-18 Thread Magicloud Magiclouds
I see. A known bug. Thank you all.


On Tue, Dec 18, 2012 at 10:11 PM, Rico Moorman wrote:

> regex = "]+>.*?"-
>>
>
> And mind the sneaky single "-" ... it doe not belong there ;-)
>
>
>



-- 
竹密岂妨流水过
山高哪阻野云飞

And for G+, please use magiclouds#gmail.com.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] A weird bug of regex-pcre

2012-12-18 Thread Rico Moorman
>
> regex = "]+>.*?"-
>

And mind the sneaky single "-" ... it doe not belong there ;-)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] A weird bug of regex-pcre

2012-12-18 Thread Rico Moorman
I had similar issues a while ago. It had to do with UTF-8 encoding as far
as I can recall.

I wanted to "wrap" a multiline string (code listings) within some pandoc
generated HTML of a hakyll page with a container "div". The text to wrap
would be determined using a PCRE regex.

Here the (probably inefficient) implementation:

module Transformations where

import Hakyll
import qualified Text.Regex.PCRE as RE
import qualified Data.ByteString.UTF8 as BSU
import qualified Data.ByteString as BS

-- Wraps numbered code listings within the page body with a div
-- in order to be able to apply some more specific styling.
wrapNumberedCodelistings (Page meta body) =
Page meta newBody
where
newBody = regexReplace' regex wrap body
regex = "]+>.*?"-
wrap x = "" ++ x ++ ""


-- Replace the whole string matched by the given
-- regex using the given replacement function (hopefully UTF8-aware)
regexReplace' :: String -> (String -> String) -> String -> String
regexReplace' pattern replace text = BSU.toString $ go textUTF8
where
patternUTF8 = BSU.fromString pattern
textUTF8 = BSU.fromString text
replaceUTF8 x = BSU.fromString $ replace $ BSU.toString x
regex = RE.makeRegexOpts compOpts RE.defaultExecOpt $
BSU.fromString pattern
compOpts = RE.compMultiline + RE.compDotAll + RE.compUTF8 +
RE.compNoUTF8Check
go part = case RE.matchM regex part of
Just (before, match, after) ->
BS.concat [before, replaceUTF8 match, go after]
_ -> part


The discussion back then was
http://www.haskell.org/pipermail/beginners/2012-June/010064.html

Hope this helps.

Best regards,

Rico Moorman


P.S. Sorry for the double email Magicloud ... didn't hit reply all at first

On Tue, Dec 18, 2012 at 10:43 AM, José Romildo Malaquias <
j.romi...@gmail.com> wrote:

> On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
> > Attachment is the test text file.
> > And I tested my regexp as this:
> >
> > Prelude> :m + Text.Regex.PCRE
> > Prelude Text.Regex.PCRE> z <- readFile "test.html"
> > Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ " > href=\"(.*?)\">.*? String,
> > [String])
> > Prelude Text.Regex.PCRE> b
> > ...
> > n of the Triumvirate\r\nDavid
> Rapoza\r\n
> >\r\n  Return to Ravnica\r\n
>  \r\n
> >10/31/2012\r\n  \r\n   > class=\"small\"><"
> > Prelude Text.Regex.PCRE> m
> > "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> > class=\"article-image\" "
> >
> > >From the value of b and m, it was weird that the matching was moved
> forward
> > by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a
> > and so on gave correct results. It was only the first matching that was
> > broken.
> > Tested with regex-posix (with modified regexp), everything is OK.
>
> I have a similar issue with non-ascii strings. It seems that the
> internal representation used by Haskell and pcre are different and one
> of them is counting bytes and the other is counting code points. So they
> diverge when a multi-byte representation (like utf8) is used.
>
> It has been reported previously. See these threads:
>
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029
>
> I am still waiting for a new release of regex-pcre that fixes this
> issue.
>
> Romildo
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] A weird bug of regex-pcre

2012-12-18 Thread José Romildo Malaquias
On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
> Attachment is the test text file.
> And I tested my regexp as this:
> 
> Prelude> :m + Text.Regex.PCRE
> Prelude Text.Regex.PCRE> z <- readFile "test.html"
> Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ " href=\"(.*?)\">.*? [String])
> Prelude Text.Regex.PCRE> b
> ...
> n of the Triumvirate\r\nDavid Rapoza\r\n
>\r\n  Return to Ravnica\r\n\r\n
>10/31/2012\r\n  \r\n   class=\"small\"><"
> Prelude Text.Regex.PCRE> m
> "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"> class=\"article-image\" "
> 
> >From the value of b and m, it was weird that the matching was moved forward
> by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a
> and so on gave correct results. It was only the first matching that was
> broken.
> Tested with regex-posix (with modified regexp), everything is OK.

I have a similar issue with non-ascii strings. It seems that the
internal representation used by Haskell and pcre are different and one
of them is counting bytes and the other is counting code points. So they
diverge when a multi-byte representation (like utf8) is used.

It has been reported previously. See these threads:

http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029

I am still waiting for a new release of regex-pcre that fixes this
issue.

Romildo

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe