On 2010-10-03 <17:43:21>, Thomas A. Schmitz wrote: > OK, I'll write something for German and English, but the thing > is that we need more input what users expect. For mixtures with > foreign languages, there might not be generally accepted rules at > all, so people will define something on an ad-hoc basis.
Hi Thomas and others, technically speaking the problem is solved by ISO 14651.[1] In praxi multilingual sorting depends on local rules, of which “One index per script|language.” seems to be the most common. Some time ago I made an lpeg from the bnf in [1]. It matches the collation rules from [2], but as I couldn’t figure out how to map them onto context’s sorting mechanism I never got around to actually capture the information. As I won’t be having the time to try it with the new structure of sort-lan I guess I’ll just attach the peg grammar for anyone to use as a starting point. Unicode collation would be great to have in context. > transliteration. The problem with polytonic Greek is that so many > different unicode characters need to have the same sort entry. If Isn’t that just what the Greek rules in sort-lan.lua do? If not then it would be a bug. ····startsnippet················································· definitions["gr"] = { entries = { ["α"] = "α", ["ά"] = "α", ["ὰ"] = "α", ["ᾶ"] = "α", ["ᾳ"] = "α", ["ἀ"] = "α", ["ἁ"] = "α", ["ἄ"] = "α", ["ἂ"] = "α", ["ἆ"] = "α", ["ἁ"] = "α", ["ἅ"] = "α", ["ἃ"] = "α", ["ἇ"] = "α", ["ᾁ"] = "α", ["ᾴ"] = "α", ["ᾲ"] = "α", ["ᾷ"] = "α", ["ᾄ"] = "α", ["ᾂ"] = "α", ["ᾅ"] = "α", ["ᾃ"] = "α", ["ᾆ"] = "α", ["ᾇ"] = "α", ["β"] = "β", ····stopsnippet·················································· Always nice to have a decent discussion on sorting ;) Philipp [1] http://standards.iso.org/ittf/PubliclyAvailableStandards/c044872_ISO_IEC_14651_2007(E).zip [2] http://www.iso.org/ittf/ISO14651_2006_TABLE1_En.txt -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
require "lpeg" local C, Cs, Ct, P, R, S, V, match = lpeg.C, lpeg.Cs, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.match local iso_parser rules = P{ [1] = "weight_table", -- Define collation tables as sequences of lines weight_table = V"common_template_table" + V"tailored_table", common_template_table = V"simple_line"^0, tailored_table = V"table_line"^0, -- Define the line types simple_line = (V"symbol_definition" + V"collating_element" + V"weight_assignment" + V"order_end")^-1 * V"line_completion" --/ function (first) io.write("simple: "..first) end , --table_line = V"simple_line" + V"tailoring_line", table_line = V"tailoring_line" + V"simple_line", tailoring_line = (V"reorder_after" + V"order_start" + V"reorder_end" + V"section_definition" + V"reorder_section_after") * V"line_completion" --/ function (first) io.write("tailoring: "..first) end , -- Define the basic syntax for collation weighting symbol_definition = P"collating-symbol" * V"space"^1 * V"symbol_element", symbol_element = V"symbol"-V"symbol_range" + V"symbol_range", symbol_range = V"symbol" * P".." * V"symbol", symbol = V"simple_symbol" + V"ucs_symbol", ucs_symbol = (P"<U" * V"one_to_eight_digit_hex_string" * P">") + (P"<U-" * V"one_to_eight_digit_hex_string" * P">"), simple_symbol = P"<" * V"identifier" * P">", collating_element = P"collating-element" * V"space"^1 * V"symbol" * V"space"^1 * P"from" * V"space"^1 * V"quoted_symbol_sequence", quoted_symbol_sequence = P'"' * V"simple_weight"^1 * P'"', --weight_assignment = V"simple_weight" + V"symbol_weight", weight_assignment = V"symbol_weight" + V"simple_weight", simple_weight = V"symbol_element" + P"UNDEFINED", symbol_weight = V"symbol_element" * V"space"^1 * V"weight_list", weight_list = V"level_token" * (V"semicolon" * V"level_token")^0, level_token = V"symbol_group" + P"IGNORE", symbol_group = V"symbol_element" + V"quoted_symbol_sequence", order_end = P"order_end", -- Define the tailoring syntax reorder_after = P"reorder-after" * V"space"^1 * V"target_symbol", target_symbol = V"symbol", order_start = P"order_start" * V"space"^1 * V"multiple_level_direction", multiple_level_direction = V"direction" * (V"semicolon" * V"direction")^0 * P",position"^-1, direction = P"forward" + P"backward", reorder_end = P"reorder-end", section_definition = V"section_definition_simple" + V"section_definition_list", section_definition_simple = P"section" * V"space"^1 * V"section_identifier", section_identifier = V"identifier", section_definition_list = P"section" * V"space"^1 * V"section_identifier" * V"space"^1 * V"symbol_list", symbol_list = V"symbol_element" * (V"semicolon" * V"symbol_element")^0, reorder_section_after = P"reorder-section-after" * V"space"^1 * V"section_identifier" * V"space"^1 * V"target_symbol", -- Define low-level tokens used by the rest of the syntax identifier = (V"letter" + V"digit") * V"id_part"^0, id_part = V"letter" + V"digit" + S"-_", line_completion = V"space"^0 * V"comment"^-1 * V"EOL", comment = V"comment_char" * V"character"^0, one_to_eight_digit_hex_string = V"hex_upper"^-8, hex_numeric_string = V"hex_upper"^1, space = S" \t", semicolon = P";", comment_char = P"%", digit = R"09", hex_upper = V"digit" + S"ABCDEF", letter = R"az" + R"AZ", EOL = P"\n", character = 1-V"EOL", } f = io.open("iso14651.txt", "r") tab = f:read("*all") f:close() --rules:print() print(rules:match(tab))
pgp2ilcFDt5gh.pgp
Description: PGP signature
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________