Hi Steven, On Wed, 2011-01-26 at 15:17 +1000, Steven Butler wrote: > > > One idea, can we generate thesaurus idx file during install? That may > > > solve few megabytes. .. > I have had an attempt at this - code attached, it is dual licensed under > LGPL / MIT although there are no (c) headers in the file (feel free to add > some).
Wow - great work :-) I've just pushed this to dictionaries/source in master, and compiled it there. Still need some tweaks to get it called in the various dictionaries/ makefiles I suppose - but it is a great start thanks ! Licensing wise - I'd like to add the standard LGPLv3+/MPL header to it (see bootstrap/) but having MIT too is fine if you want. I was going to add it as an easy hack, but you beat me to it :-) > I have no idea how this would be integrated into the build process as I'm > not even sure where it is called from, but happy if someone wants to > take up the challenge and/or incorporate it as an installer process. So - the installer process is more exciting on Windows I think - we'll need to see how the setup_native/ tools are called and be inspired by that I think. > Here's timing of the CPP version on a Core i5 amd64 generating the > following indices: .. > The same set of files using th_gen_idx.pl took around 5 seconds (although > some basic fixups got it done to 3.5 seconds). Great - its trivial; indeed - it rather makes you wonder whether we need the indexes at all ? [ I wonder what they are good for, and/or what code loads and uses them ;-]. We may discover that in fact there is no need for them to be indexed - any chance of a dig around ? > What I have noticed while testing the change was that a lot of the > dictionaries I processed have errors. Nasty. > These range from having the entry count incorrect, causing the index > process to miss a word (lots of these in some dictionaries), to having > words apparently duplicated either as the next entry, or sometimes a long > way apart. That is bad; we should mail the l10n list to ask them to have a look I suppose. > I have not attempted to fix these dictionary issues, but if they are > serious it might be worth having a perl script that is able to validate > the dictionaries are internally consistent. Unfortunately, it would have > to use heuristics as the file format makes it difficult to tell in general > what kind of line is being processed. Right; we should validate them as we compile the index perhaps - or at least, look at the parser and see how it has traditionally interpreted them. > The CPP version attached has a difference from the perl script in that > when multiple entries are found, they appear to be coming out in reverse > order to the original perl script. What I'm curious about is what impact > Having multiple entries for a word when loaded into libreoffice? Me too ;-) > For reference I have attached an improved perl version of the perl script > that runs a couple of seconds faster than the original. I had three to > four versions in my tree but changing none of them triggered a git diff to > show the changes so I've attached the full copy. The native code thing is great; it'd be wonderful if you had some time to look at hooking it into the build process in dictionaries/ (?) Thanks muchly ! Michael. -- michael.me...@novell.com <><, Pseudo Engineer, itinerant idiot _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice