Re: [MarkLogic Dev General] unicode character class 'Letter'

Mary Holstege Thu, 03 Jun 2010 19:40:21 -0700

Quite right you are.  There were some errors in our Unicode tables,
which were also built against an older version of the Unicode standard.
As of MLS 4.2: matches("&#x2cc;","\p{L}") => true()


//Mary

On Thu, 03 Jun 2010 17:03:23 -0700, Michael Sokolov <[email protected]>  
wrote:

> I ran across an anomaly in MarkLogicb this week while trying to evaluate  
> a
> regular expression replacement using the Letter class:
>
> replace ($string, "\P{L}", "")
>
> Some characters which are classed as letters AFAICT according to Unicode  
> are
> not treated as letters by MarkLogic.  For example, &#x2cc;, "MODIFIER  
> LETTER
> LOW VERTICAL LINE" is treated as a non-letter.
>
> This link spells out the details:
> http://www.fileformat.info/info/unicode/char/02cc/index.htm
>
> I wouldn't even have noticed if it weren't for the fact that Saxon did
> something different from ML - and I think Java would do the same (based  
> on
> the evidence on the link above, I haven't tested myself) - in Saxon I  
> had to
> use the "modifier letter" class: \P{Lm} to remove these characters.
>
> I have to say, it doesn't look like a letter to me (it's a little line -  
> a
> stress marker): MarkLogic performed as I was expecting, at first, but  
> that's
> only because I am not a walking Unicode standard.  I think I'd prefer it  
> if
> ML adhered closely to the UC standard in cases like this, even if it's
> counterintuitive, if only so that it would behave the same as other
> standards-compliant software.
>
> -Mike
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] unicode character class 'Letter'

Reply via email to