Hi Olivier, Great work! Now it's possible to write a full rule converter for the existing LanguageTool modules. I will add your library to the lightproof module with an example/test.
By the way, I have ported the lightproof modules to Python 3.3, but it seems, there is a registration issue with the bundled dictionary packages with Lightproof components (unfortunately, I couldn't test it yesterday, because the daily build had a missing library problem on Ubuntu), so I will write soon. Best regards, László 2012/12/4 Olivier R. <olivier.nore...@gmail.com> > My connection ended while posting. Here is the full post: > > > Hello everyone, > > ## Build indexable binary grammatically tagged dictionaries for > Lightproof/Grammalecte ## > > The most important limitation for building a grammar checker with > Lightproof > was the lack of grammatically tagged dictionaries. Most of Hunspell > dictionaries, which Lightproof can handle via LibreOffice-UNO, are not > grammatically tagged and cannot be of any help to retrieve morphological > information about words. > > LanguageTool has not this problem since it’s using binary indexable > dictionaries built on huge grammatically tagged lexicons with a > finite-state > automaton (fsa) software > ( > http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html > ) > written in C. Java has a dedicated library to read these binary files. > > But we had nothing such as this in Python. > So I tried to understand how this FSA software in C works, but as I am not > a > C expert and as I was upset to depend again on another software, I finally > decided to write my own FSA tool to build such indexable binary > dictionaries. > > Why build such dictionaries? you may ask. Because lexicons which contain > words, lemmas and morphological tags are HUGE, up to several megabytes, > they > are not indexable as is and it uses much more memory to make them such. So > the goal is to make them small, compressed, quick to load and to parse, low > memory consuming, indexable, readable without having to uncompress them. > > That’s what I did with Python 3.3. > > I took all lexicons from LanguageTool and I compressed them in binary > indexable dictionaries readable with my own script. > The built dictionaries are not as small as the ones made with the C FSA > tool > used by LT, but it’s close enough and there is still room for improvements. > I’ll work on this later. > > Here are the results: > > > These dictionaries are about 5-30 % bigger than the LT ones (and sometimes > surprisingly twice smaller), but anyhow it’s perfectly usable as is. > > Consequences: > — it will be possible to use all existing LT lexicons with Lightproof, > — we will be able to make a stand-alone version of Lightproof/Grammalecte > as > it won’t be necessary to use Hunspell anymore, > — we will be able to write automated tests and prevent regressions when > writting/modifying rules. > > > # Lexicons > > Lexicon are simple text document listing all flexions, their stem and their > morphological tags: > > > > Each field is separated with a tabulation. > > With the new tool, lexicons MUST be UTF-8 encoded to be properly converted. > > > # Want to test it? > > The code is written with Python 3.3. License: MPL 2. > > Two files: > — fsa_builder.py reads all files listed in "_lexicons.list.txt" and > builds binary dictionaries with a specific stemming command. > — fsa_reader.py reads all files whose name is "[lang].bdic", and if > it > finds a test file named "[lang].test.txt" writes results found for each > word > in a new file. > > The builder with uncompressed LT lexicons encoded in UTF-8: > http://dicollecte.free.fr/download/fsa1/pyFSA_builder.7z [130 Mb] > > Type: > > > > And let it run. Warning: building dictionaries is slow, as lexicons are > huge. For most langages it takes 1 or 2 minutes for each. But for german, > polish, galician, russian, czech, it tooks 5 to 10 minutes for each, and it > consumes a huge amount of memory. The czech uses up to 6 Gb! You have been > warned. :) > > The dictionary reader with binary dictionaries and test files: > http://dicollecte.free.fr/download/fsa1/pyFSA_reader.7z [11 Mb] > > Type: > > > > Let it run. Count to 1 (or 2 if you have a slow computer). And it’s already > finished. :) > It has read all binary dictionaries, read the test files, and written the > results in other files. > > I’ll try to write a more complete web page about this when I have the time. > I still have to compress it better, for those who might think it’s not > enough. > > > Regards, > Olivier R. > > > > -- > View this message in context: > http://nabble.documentfoundation.org/Grammar-checking-Using-LanguageTool-lexicons-with-Lightproof-new-possible-tp4022489p4022495.html > Sent from the Dev mailing list archive at Nabble.com. > _______________________________________________ > LibreOffice mailing list > LibreOffice@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/libreoffice >
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice