Hi Liza,
The xdmp:diacritic-less function will return the flattened character for a
diacritic. For example:
xdmp:diacritic-less("é")
returns "e"
or if you want the codepoint:
fn:string-to-codepoints(xdmp:diacritic-less("é"))
returns 101
Does this give you what you are looking for?
Thanks,
-Danny
-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Liza Daly
Sent: Tue 2/27/2007 12:38 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Map of all supported diacriticals?
Thanks Jim, these are good ideas if I'll need to extend the mappings
myself because MarkLogic doesn't provide a more direct interface.
I would still love to hear from someone at ML how to retrieve the list
of mappings that are provided by default, though.
--Liza
James A. Robinson wrote:
>> Is it possible to get a list of all mappings between characters with
>> diacriticals and their "flattened" ASCII equivalents?
>>
>> Similarly, is there a way to extend or modify this mapping in the
>> current version of MarkLogic?
>
> I'm new to MLS, so I don't know if there is a way to do it there.
> Hopefully it is, I think it'd be a really nice feature w/re to content
> mangement (e.g., being able to build pages friendly to older browsers
> w/o having to go through a lot of external processing steps).
>
> Failing that, in theory I think one could do this using either Java by
> itself or using Java paired with XSLT or perhaps XQuery.
>
> http://www.w3.org/TR/xslt20/#element-output
> http://en.wikipedia.org/wiki/Unicode_normalization
> http://unicode.org/reports/tr15/#Decomposition
>
> What I'm thinking you could do is load create a map of the characters you
> want to flatten and process them with something which makes two copies
> of special characters in attribute fields, output outputting everything
> using NFD encoding. You then postprocess that with a program operating
> on the byte level which can strip out the 'diacritic' characters from
> one of the fields, leaving you with a mapping of the accented character
> to a flattened version.
>
> Attached below are two files, pmap.xsl and pmap.xml. The .xml was built
> by running Saxon against the .xsl file. It should, I think, be possible
> to then use Java (or C or C++ I suppose) to read in the .xml file using
> a byte stream, and upon hitting the 'flattened' attribute, start dropping
> bytes which aren't in the range 1-127, until you hit the quote. It may or
> may not be possible to do that in a nicer fashion using a real XML parser.
> I suspect most parsers will consume the NFD form and turn it into UTF-16
> or whatever they use to internally represent the characters.
>
>
> Jim
>
>
>
> ------------------------------------------------------------------------
>
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson [EMAIL PROTECTED]
> Stanford University HighWire Press http://highwire.stanford.edu/
> +1 650 7237294 (Work) +1 650 7259335 (Fax)
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general