[lingu-dev] Lightproof grammar checker 1.0

Németh László Fri, 24 Apr 2009 20:08:38 -0700

Hi,

Supported by the FSF.hu Foundation, Hungary, I have developed a fast
grammar checker and a simple framework in Python to speed up grammar
checker developments of OpenOffice.org:


http://extensions.services.openoffice.org/node/2301

Translating and modifying a little the template rules are enough for a
minimalistic grammar checker for a new language.

Please, report it, if you have any special language problem with the
grammar checker (morphological and syntactic rules aren't supported
yet). I have already added special casing support for Turkish, Azeri
and Duch, but for example, I haven't tried the grammar checking rules
with Asian languages, yet.

Best regards,
László

P.S. I have missed from the manual, that the rules are sentence-level
regex patterns, so ^ and $ mean sentence boundaries.
P.S. 2 Documentation, sample English rules:

------------- doc/manual.txt ------------
Adding new language support

1. Rename data/tutorial.dat to your locale ID, ie. xx_YY.dat or xx.dat
   with language and country identifiers).

2. Translate messages, modify or add new rules (see doc/syntax.txt).

3. Type make in the root directory. (Without a Unix or Cygwin
   environment, you can compile your dat file with the
   following commands in the pythonpath subfolder (replace slashes
   to backslashes under Windows):

   cd pythonpath
   python Convert.py ../data/your_locale.dat >lightproof_your_locale.py
   python Locale.py ../data/*.dat >lightproof_lang.py

4. Type make dist to zip the distribution (or use your zip compressor
   in the root directory, eg.

   zip -r lightproof.oxt .

5. Check it in the OpenOffice.org Tools->Options->Language Settings->
   Writing Aids after the installation by the Tools->Extension manager->
   Add dialog and menu item.

   Note: Without country identifiers (xx.dat, not xx_XX.dat data files)
   the grammar checking won't be default for this language. Choose Lightproof
   grammar checker in the Writings Aid Options page and click on the Edit
   button. Select your language in the Edit Modules dialog, and check
   in the grammar checker.
-----------------------

---------------------- doc/syntax.txt -------------
= Encoding =

UTF-8

= Rule syntax =

pattern -> replacement # message

Basically pattern and replacement will be the parameters of the
standard Python re.sub() regular expression function (see also
Python regex module documentation for regular expression syntax).

Example 0. Report "foo" in the text and suggest "bar":

foo -> bar                      # Use bar instead of foo.

Note: this rule recognizes "foo" in words, too. For
whole word only matching we will use the zero-length word
boundary regex notation \b.

Example 1. Recognize and suggest missing hyphen:

\bfoo bar\b -> foo-bar          # Missing hyphen.

Here \b signs the end and the begin of the words.)

Example 2. Recognize double or more spaces and suggests a single space

"  +" -> " "                    # Extra space.

ASCII " characters protect spaces in the pattern and in the replacement text.
Plus sign means 1 or more repetitions of the previous space.

Example 3. Suggest a word with correct quotation marks:

\"(\w+)\" -> “\1”               # Correct quotation marks.

(Here \" is an ASCII quotation mark, \w means an arbitrary letter,
+ means 1 or more repetitions of the previous object,
The parentheses define a regex group (the word). In the
replacement, \1 is a reference to the (first) group of the pattern.)

Example 4. Suggest the missing space after the !, ? or . signs:

\b([?!.])([a-zA-Z]+) -> \1 \2   # Missing space?

The [ and ] define a character pattern, the replacement will contain
the actual matching character (?, ! or .), a space and the word after
the punctuation character.
Note: ? and . characters have special meanings in regular expressions,
use [?] or [.] patterns to check "?" and "." signs in the text.

== Case-insensitive patterns ==

Add the Python "(?i)" notation to the pattern for case insensitive
matching and capitalized suggestions:

(?i)\bfoo bar\b -> foo-bar      # Missing hyphen.

The proofreader will recognize also "Foo bar" and "FOO BAR"
(and suggests "Foo-bar" instead of "foo-bar" for capitalized matchings).

For more special casing, you can use grouping or name definitions (see
later):

(?i)\b(Foo) (Bar)\b -> \1-\2    # Missing hyphen.

or multiple rules:

\bFoo Bar\b -> Foo-Bar          # Missing hyphen.
\bFOO BAR\b -> FOO-BAR          # Missing hyphen.

== Multiple suggestions ==

Use \n (new line) in the replacement text to add multiple suggestions:

foo -> Foo\nFOO\nBar\nBAR       # Did you mean:

(Foo, FOO, Bar and BAR suggestions for the input word "foo")

== Tests ==

It is recommended to add test for the rules by the TEST keyword:

foo([xy]) -> bar(\1)            # Did you mean:
TEST: foox -> barx

The rule precompiler will check the matching and suggestions
of the TESTs.

== Name definitions ==

Lightproof supports name definitions to simplify the
description of the complex rules.

Definition:

name pattern                    # name definition

Usage in the rules:

"{name} " -> "{name}. "         # Missing dot?

{Name}s in the first part of the rules mean
subpatterns (groups). {Name}s in the second
part of the rules mean back references to the
matched texts of the subpatterns.

Example: thousand markers (10000 -> 10,000 or 10 000)

# definitions
d \d\d\d        # name definition: 3 digits
d2 \d\d         # 2 digits
D \d{1,3}       # 1, 2 or 3 digits

# rules
# ISO thousand marker: space, here: no-break space (U+00A0)
\b{d2}{d}\b -> {d2},{d}\n{d2} {d}               # Use thousand marker (common 
or ISO).
\b{D}{d}{d}\b -> {D},{d},{d}\n{D} {d} {d}       # Use thousand markers
(common or ISO).
TEST: 123456789 -> 123,456,789\n123 456 789

Note: Lightproof uses named groups for name definitions and
their references, adding a hidden number to the group names
in the form of "_n". You can use these explicit names in the replacement:

\b{d2}{d}\b -> {d2_1},{d_1}\n{d2_1} {d_1}       # Use thousand marker (common 
or ISO).
\b{D}{d}{d}\b -> {D_1},{d_1},{d_2}\n{D_1} {d_1} {d_2} # Use thousand
markers (common or ISO).

Note: back references of name definitions are zeroed after new line
characters, see this and the following example:

E ( |$)                         # name definition: space or end of sentence
"\b[.][.]{E}" -> .{E}\n…{E}     # Period or ellipsis?

See data/template.dat for more examples.

-------------------- data/en_US.dat -----------------

# Sample proofreading rules for English

# punctuation

" ([.?!,:;)”—]($| ))" -> \1     # Extra space before punctuation.
"([(“—]) " -> \1                # Extra space after punctuation.

"^[-—] " -> "– "                # Hyphen instead of n-dash.
" [-—]([ ,;])" -> " –\1"        # Hyphen instead of n-dash.

TEST: ( item ) -> (item)
TEST: A small - but reliable - example. -> A small – but reliable – example.

# definitions
abc [a-z]+
ABC [A-Z]+
Abc [a-zA-Z]+
punct [?!,:;%‰‱˚“”‘]

{Abc}{punct}{Abc} -> {Abc}{punct} {Abc} # Missing space?
{abc}[.]{ABC} -> {abc}. {ABC}           # Missing space?
TEST: missing,space -> missing, space
TEST: missing.Space -> missing. Space

(\d+)x(\d+) -> \1×\2 # Multiplication sign.
TEST: 800x600 -> 800×600

# typography
"[.]{3}" -> "…"                 # Three dot character.

(^|\b|{punct}|[.]) {2,3}\b -> "\1 " # Extra space.
TEST: Extra  space -> Extra space
TEST: End... -> End…

# quotation

\"(\w[^\"“”]*[\w.?!,])\" -> “\1”        # Quotion marks.
\B'(\w[^']*[\w.?!,])'\B -> ‘\1’         # Quotion marks.
TEST: "The 'old' boy." -> “The ‘old’ boy.”

# apostrophe

w \w*
(?i){Abc}'{w} -> {Abc}’{w}      # Apostrophe.
TEST: o'clock -> o’clock
TEST: singers' voices -> singers’ voices

# words

# frequent mistakes

# silent h
(?i)\ba 
(honest(y|ly)?|hour(ly|glass)?|honou?r(abl[ey]|ed|ing|ifics?|s)|heir(less|loom)?)\b
-> an \1 # Did you mean:
TEST: A heirloom -> An heirloom

# possessive pronouns
(?i)\b(your|her|our|their)['’]s\b -> \1s # Did you mean:
TEST: Your's -> Yours

# duplicates
\b(and|or|for)\b \1 -> \1 # Did you mean:

(?i)\bcomprises of\b -> comprises # Did you mean:

# rare words (potential errors)

# multiword expressions

\bscot free\b -> scot-free\nscotfree # Did you mean:
TEST: scot free -> scot-free\nscotfree # Suggestions separated by new lines (\n)

(?i)\bying and yang\b -> yin and yang # Did you mean:

# accept foreign words only in multiword expressions

(?i)\bde(?! (facto|juro))\b -> de facto\nde juro # Missing latin expression?
TEST: de standard -> de facto\nde juro standard

# formats

# Thousand separators: 10000 -> 10,000  (common) or 10 000 (ISO standard)

# definitions
d       \d\d\d          # name definition: 3 digits
d2      \d\d            # 2 digits
D       \d|\d\d|\d\d\d  # 1, 2 or 3 digits

# ISO thousand separatos: space, here: no-break space (U+00A0)
\b{d2}{d}\b      -> {d2},{d}\n{d2} {d}                  # Use thousand 
separators (common or ISO).
\b{D}{d}{d}\b    -> {D},{d},{d}\n{D} {d} {d}            # Use thousand 
separators
(common or ISO).
\b{D}{d}{d}{d}\b -> {D},{d},{d},{d}\n{D} {d} {d} {d}    # Use thousand
separators (common or ISO).
TEST: 1234567890 -> 1,234,567,890\n1 234 567 890

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

[lingu-dev] Lightproof grammar checker 1.0

Reply via email to