RE: [MarkLogic Dev General] Map of all supported diacriticals?

Danny Sokolsky Tue, 27 Feb 2007 13:15:58 -0800

Hi Liza,

The xdmp:diacritic-less function will return the flattened character for a 
diacritic.  For example:


xdmp:diacritic-less("é")

returns "e"

or if you want the codepoint:

fn:string-to-codepoints(xdmp:diacritic-less("é"))  

returns 101

Does this give you what you are looking for?

Thanks,
-Danny


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Liza Daly
Sent: Tue 2/27/2007 12:38 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Map of all supported diacriticals?
 
Thanks Jim, these are good ideas if I'll need to extend the mappings 
myself because MarkLogic doesn't provide a more direct interface.

I would still love to hear from someone at ML how to retrieve the list 
of mappings that are provided by default, though.

--Liza

James A. Robinson wrote:
>> Is it possible to get a list of all mappings between characters with 
>> diacriticals and their "flattened" ASCII equivalents?
>>
>> Similarly, is there a way to extend or modify this mapping in the 
>> current version of MarkLogic?
> 
> I'm new to MLS, so I don't know if there is a way to do it there.
> Hopefully it is, I think it'd be a really nice feature w/re to content
> mangement (e.g., being able to build pages friendly to older browsers
> w/o having to go through a lot of external processing steps).
> 
> Failing that, in theory I think one could do this using either Java by
> itself or using Java paired with XSLT or perhaps XQuery.
> 
>   http://www.w3.org/TR/xslt20/#element-output
>   http://en.wikipedia.org/wiki/Unicode_normalization
>   http://unicode.org/reports/tr15/#Decomposition
> 
> What I'm thinking you could do is load create a map of the characters you
> want to flatten and process them with something which makes two copies
> of special characters in attribute fields, output outputting everything
> using  NFD encoding.  You then postprocess that with a program operating
> on the byte level which can strip out the 'diacritic' characters from
> one of the fields, leaving you with a mapping of the accented character
> to a flattened version.
> 
> Attached below are two files, pmap.xsl and pmap.xml.  The .xml was built
> by running Saxon against the .xsl file.  It should, I think, be possible
> to then use Java (or C or C++ I suppose) to read in the .xml file using
> a byte stream, and upon hitting the 'flattened' attribute, start dropping
> bytes which aren't in the range 1-127, until you hit the quote.  It may or
> may not be possible to do that in a nicer fashion using a real XML parser.
> I suspect most parsers will consume the NFD form and turn it into UTF-16
> or whatever they use to internally represent the characters.
> 
> 
> Jim
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson                       [EMAIL PROTECTED]
> Stanford University HighWire Press      http://highwire.stanford.edu/
> +1 650 7237294 (Work)                   +1 650 7259335 (Fax)
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Map of all supported diacriticals?

Reply via email to