[MarkLogic Dev General] unicode character class 'Letter'

Michael Sokolov Thu, 03 Jun 2010 17:03:49 -0700

I ran across an anomaly in MarkLogicb this week while trying to evaluate a
regular expression replacement using the Letter class:


replace ($string, "\P{L}", "")

Some characters which are classed as letters AFAICT according to Unicode are
not treated as letters by MarkLogic.  For example, &#x2cc;, "MODIFIER LETTER
LOW VERTICAL LINE" is treated as a non-letter.

This link spells out the details:
http://www.fileformat.info/info/unicode/char/02cc/index.htm

I wouldn't even have noticed if it weren't for the fact that Saxon did
something different from ML - and I think Java would do the same (based on
the evidence on the link above, I haven't tested myself) - in Saxon I had to
use the "modifier letter" class: \P{Lm} to remove these characters.

I have to say, it doesn't look like a letter to me (it's a little line - a
stress marker): MarkLogic performed as I was expecting, at first, but that's
only because I am not a walking Unicode standard.  I think I'd prefer it if
ML adhered closely to the UC standard in cases like this, even if it's
counterintuitive, if only so that it would behave the same as other
standards-compliant software.

-Mike

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] unicode character class 'Letter'

Reply via email to