Hi, Supported by the FSF.hu Foundation, Hungary, I have developed a fast grammar checker and a simple framework in Python to speed up grammar checker developments of OpenOffice.org:
http://extensions.services.openoffice.org/node/2301 Translating and modifying a little the template rules are enough for a minimalistic grammar checker for a new language. Please, report it, if you have any special language problem with the grammar checker (morphological and syntactic rules aren't supported yet). I have already added special casing support for Turkish, Azeri and Duch, but for example, I haven't tried the grammar checking rules with Asian languages, yet. Best regards, László P.S. I have missed from the manual, that the rules are sentence-level regex patterns, so ^ and $ mean sentence boundaries. P.S. 2 Documentation, sample English rules: ------------- doc/manual.txt ------------ Adding new language support 1. Rename data/tutorial.dat to your locale ID, ie. xx_YY.dat or xx.dat with language and country identifiers). 2. Translate messages, modify or add new rules (see doc/syntax.txt). 3. Type make in the root directory. (Without a Unix or Cygwin environment, you can compile your dat file with the following commands in the pythonpath subfolder (replace slashes to backslashes under Windows): cd pythonpath python Convert.py ../data/your_locale.dat >lightproof_your_locale.py python Locale.py ../data/*.dat >lightproof_lang.py 4. Type make dist to zip the distribution (or use your zip compressor in the root directory, eg. zip -r lightproof.oxt . 5. Check it in the OpenOffice.org Tools->Options->Language Settings-> Writing Aids after the installation by the Tools->Extension manager-> Add dialog and menu item. Note: Without country identifiers (xx.dat, not xx_XX.dat data files) the grammar checking won't be default for this language. Choose Lightproof grammar checker in the Writings Aid Options page and click on the Edit button. Select your language in the Edit Modules dialog, and check in the grammar checker. ----------------------- ---------------------- doc/syntax.txt ------------- = Encoding = UTF-8 = Rule syntax = pattern -> replacement # message Basically pattern and replacement will be the parameters of the standard Python re.sub() regular expression function (see also Python regex module documentation for regular expression syntax). Example 0. Report "foo" in the text and suggest "bar": foo -> bar # Use bar instead of foo. Note: this rule recognizes "foo" in words, too. For whole word only matching we will use the zero-length word boundary regex notation \b. Example 1. Recognize and suggest missing hyphen: \bfoo bar\b -> foo-bar # Missing hyphen. Here \b signs the end and the begin of the words.) Example 2. Recognize double or more spaces and suggests a single space " +" -> " " # Extra space. ASCII " characters protect spaces in the pattern and in the replacement text. Plus sign means 1 or more repetitions of the previous space. Example 3. Suggest a word with correct quotation marks: \"(\w+)\" -> “\1” # Correct quotation marks. (Here \" is an ASCII quotation mark, \w means an arbitrary letter, + means 1 or more repetitions of the previous object, The parentheses define a regex group (the word). In the replacement, \1 is a reference to the (first) group of the pattern.) Example 4. Suggest the missing space after the !, ? or . signs: \b([?!.])([a-zA-Z]+) -> \1 \2 # Missing space? The [ and ] define a character pattern, the replacement will contain the actual matching character (?, ! or .), a space and the word after the punctuation character. Note: ? and . characters have special meanings in regular expressions, use [?] or [.] patterns to check "?" and "." signs in the text. == Case-insensitive patterns == Add the Python "(?i)" notation to the pattern for case insensitive matching and capitalized suggestions: (?i)\bfoo bar\b -> foo-bar # Missing hyphen. The proofreader will recognize also "Foo bar" and "FOO BAR" (and suggests "Foo-bar" instead of "foo-bar" for capitalized matchings). For more special casing, you can use grouping or name definitions (see later): (?i)\b(Foo) (Bar)\b -> \1-\2 # Missing hyphen. or multiple rules: \bFoo Bar\b -> Foo-Bar # Missing hyphen. \bFOO BAR\b -> FOO-BAR # Missing hyphen. == Multiple suggestions == Use \n (new line) in the replacement text to add multiple suggestions: foo -> Foo\nFOO\nBar\nBAR # Did you mean: (Foo, FOO, Bar and BAR suggestions for the input word "foo") == Tests == It is recommended to add test for the rules by the TEST keyword: foo([xy]) -> bar(\1) # Did you mean: TEST: foox -> barx The rule precompiler will check the matching and suggestions of the TESTs. == Name definitions == Lightproof supports name definitions to simplify the description of the complex rules. Definition: name pattern # name definition Usage in the rules: "{name} " -> "{name}. " # Missing dot? {Name}s in the first part of the rules mean subpatterns (groups). {Name}s in the second part of the rules mean back references to the matched texts of the subpatterns. Example: thousand markers (10000 -> 10,000 or 10 000) # definitions d \d\d\d # name definition: 3 digits d2 \d\d # 2 digits D \d{1,3} # 1, 2 or 3 digits # rules # ISO thousand marker: space, here: no-break space (U+00A0) \b{d2}{d}\b -> {d2},{d}\n{d2} {d} # Use thousand marker (common or ISO). \b{D}{d}{d}\b -> {D},{d},{d}\n{D} {d} {d} # Use thousand markers (common or ISO). TEST: 123456789 -> 123,456,789\n123 456 789 Note: Lightproof uses named groups for name definitions and their references, adding a hidden number to the group names in the form of "_n". You can use these explicit names in the replacement: \b{d2}{d}\b -> {d2_1},{d_1}\n{d2_1} {d_1} # Use thousand marker (common or ISO). \b{D}{d}{d}\b -> {D_1},{d_1},{d_2}\n{D_1} {d_1} {d_2} # Use thousand markers (common or ISO). Note: back references of name definitions are zeroed after new line characters, see this and the following example: E ( |$) # name definition: space or end of sentence "\b[.][.]{E}" -> .{E}\n…{E} # Period or ellipsis? See data/template.dat for more examples. -------------------- data/en_US.dat ----------------- # Sample proofreading rules for English # punctuation " ([.?!,:;)”—]($| ))" -> \1 # Extra space before punctuation. "([(“—]) " -> \1 # Extra space after punctuation. "^[-—] " -> "– " # Hyphen instead of n-dash. " [-—]([ ,;])" -> " –\1" # Hyphen instead of n-dash. TEST: ( item ) -> (item) TEST: A small - but reliable - example. -> A small – but reliable – example. # definitions abc [a-z]+ ABC [A-Z]+ Abc [a-zA-Z]+ punct [?!,:;%‰‱˚“”‘] {Abc}{punct}{Abc} -> {Abc}{punct} {Abc} # Missing space? {abc}[.]{ABC} -> {abc}. {ABC} # Missing space? TEST: missing,space -> missing, space TEST: missing.Space -> missing. Space (\d+)x(\d+) -> \1×\2 # Multiplication sign. TEST: 800x600 -> 800×600 # typography "[.]{3}" -> "…" # Three dot character. (^|\b|{punct}|[.]) {2,3}\b -> "\1 " # Extra space. TEST: Extra space -> Extra space TEST: End... -> End… # quotation \"(\w[^\"“”]*[\w.?!,])\" -> “\1” # Quotion marks. \B'(\w[^']*[\w.?!,])'\B -> ‘\1’ # Quotion marks. TEST: "The 'old' boy." -> “The ‘old’ boy.” # apostrophe w \w* (?i){Abc}'{w} -> {Abc}’{w} # Apostrophe. TEST: o'clock -> o’clock TEST: singers' voices -> singers’ voices # words # frequent mistakes # silent h (?i)\ba (honest(y|ly)?|hour(ly|glass)?|honou?r(abl[ey]|ed|ing|ifics?|s)|heir(less|loom)?)\b -> an \1 # Did you mean: TEST: A heirloom -> An heirloom # possessive pronouns (?i)\b(your|her|our|their)['’]s\b -> \1s # Did you mean: TEST: Your's -> Yours # duplicates \b(and|or|for)\b \1 -> \1 # Did you mean: (?i)\bcomprises of\b -> comprises # Did you mean: # rare words (potential errors) # multiword expressions \bscot free\b -> scot-free\nscotfree # Did you mean: TEST: scot free -> scot-free\nscotfree # Suggestions separated by new lines (\n) (?i)\bying and yang\b -> yin and yang # Did you mean: # accept foreign words only in multiword expressions (?i)\bde(?! (facto|juro))\b -> de facto\nde juro # Missing latin expression? TEST: de standard -> de facto\nde juro standard # formats # Thousand separators: 10000 -> 10,000 (common) or 10 000 (ISO standard) # definitions d \d\d\d # name definition: 3 digits d2 \d\d # 2 digits D \d|\d\d|\d\d\d # 1, 2 or 3 digits # ISO thousand separatos: space, here: no-break space (U+00A0) \b{d2}{d}\b -> {d2},{d}\n{d2} {d} # Use thousand separators (common or ISO). \b{D}{d}{d}\b -> {D},{d},{d}\n{D} {d} {d} # Use thousand separators (common or ISO). \b{D}{d}{d}{d}\b -> {D},{d},{d},{d}\n{D} {d} {d} {d} # Use thousand separators (common or ISO). TEST: 1234567890 -> 1,234,567,890\n1 234 567 890 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org