[dev] Dictionaries for spell checking etc... (was: Re: [dev] Where our products install to)

Thomas Lange - Sun Germany - ham02 - Hamburg Mon, 11 Feb 2008 07:39:34 -0800

Hi all,

@Caolan, Petr:
I have made this answer of mine a cross post to lingucopmponent.dev as
well. And since it is about lingucomponent issues it would be nice to
continue the discussion there

@lingucomoment reades:
This mail is a reply to a posting in the openoffice.dev list.

> On Fri, 2008-02-08 at 14:05 +0100, Petr Mladek wrote:
>>
>> I think that the best solution would be to get rid of share/dict/ooo and 
>> look 
>> for the dictionaries into a common place, for example /usr/share/myspell.
>> 
>> It would be nice get rid of share/dict/ooo/dictionary.lst. The dictionaries 
>> have well defined names. It is possible to create symlinks for compatible 
>> languages, ... Well, there might be problems with symlinks on Windows but it 
>> would be very useful on Linux.
> 
> Specifically wrt dictionaries, as you probably know that's precisely
> what we do on fedora where we've done away with dictionary.lst (well it
> still works if you want to use it) and just auto-detect them and the
> language/locale they service based on their names and add looking in a
> system /usr/share/myspell location as well the shared OOo one and then
> the per-user one.
> 
> 
> If there's any interest in it, then I could try and perhaps upstream
> this work and co-opt the existing --without-myspell-dicts or whatever
> its called into a sort of --with-system-dicts=LOCATION and bind the code
> off that, or something of that nature.

It seems you guys have your own way with fedora to get rid of the
dictionary.lst.

Since we currently are in the same process I'd like to describe shortly
what we are doing. From what I understood here so far our concept is
different but both should be able to be used concurrently.
Well at least if we sort out some issues of precedence if dictionaries
for the same language and purpose are installed at various places and be
identified with various means.

Our planned, and for the most part by now implemented, idea was to allow
for dictionaries to be installed/distributed as extensions. Thus our
approach needs several new configuration entries.
BTW as with OOo 3.0 we want to get red of the way those things currently
work in OOo.
In the meantime when my CWS tl41 is finished an is integrated the old
and new behaviour will work both for a while. And for OOo 3.0 a proper
migration from the old-working-way to the new one using configuration
entries is planned. After that the old code should be removed.

Now on to what we currently do or did in the CWS
- the path settings for 'Linguistic' and 'Dictionary' have been
  changed to be multi-paths.
  The new 'Dictionary' path is now dedicated to those personal
  user-dictionaries as it always should have been.
  And the 'Linguistic' path is for data etc. that is to be used
  and found by an actual spell checker, hyphenator, ... implementation

Thus those cnfiguration setting will soon look like this:

        <node oor:name="Linguistic" oor:op="fuse" oor:mandatory="true">
            <node oor:name="InternalPaths">
                <node oor:name="$(insturl)/share/dict" oor:op="fuse"/>
                <node oor:name="$(insturl)/share/dict/ooo" oor:op="fuse"/>
            </node>
            <prop oor:name="UserPaths">
                <value>$(userurl)/wordbook</value>
            </prop>
        </node>

        <node oor:name="Dictionary" oor:op="fuse" oor:mandatory="true">
            <node oor:name="InternalPaths">
                <node oor:name="$(insturl)/share/wordbook/$(vlang)"
oor:op="fuse"/>
            </node>
            <prop oor:name="WritePath">
                <value>$(userurl)/wordbook</value>
            </prop>
        </node>

As you can see the 'Linguistic' path covers all places where previously
data files for the linguistic might have been installed.
The 'UserPaths' entry is actually a string list and thus can also hold
more than one path.

The next we did is:
- spell checkers, hyphenators, ... need to make configuration entries
  that describe what type of dictionary the may make use of.

Such an enty will look like this:

    <node oor:name="SpellCheckers">
        <node oor:name="org.openoffice.lingu.MySpellSpellChecker"
oor:op="fuse">
            <prop oor:name="SupportedDictionaryFormats"
oor:type="oor:string-list">
                <value>DICT_SPELL MySpell_old</value>
            </prop>
        </node>
    </node>

The component has to specifiy it's implementation name and a list of
dictionary formats it may make use of.
We don't have implementations that make use of more than one format at
the same time yet but we want to be flexible and future-safe with our
new configuration entries.
For example in the future we could have a dictionary format named
        DICT_SPELL_EXCEPT
that is used to identify exception dictionaries. Something that Hunspell
currently does not implement, but hopefully will do so at some point.
Then it would be normal to support the two formats
        DICT_SPELL and DICT_SPELL_EXCEPT
at the same time.

On the other side of the line we now have the new entries for dictionaries:
- dictionaries need to make entries in the configuration that
  state what they are to be used for.

I may look like this:

    <node oor:name="Dictionaries">
        <node oor:name="HunSpellDic_de_CH" oor:op="fuse">
            <prop oor:name="Locations" oor:type="oor:string-list">
                <value>%origin%/dictionaries/de_CH.aff
%origin%/dictionaries/de_CH.dic</value>
            </prop>
            <prop oor:name="Format" oor:type="xs:string">
                <value>DICT_SPELL</value>
            </prop>
            <prop oor:name="Locales" oor:type="oor:string-list">
                <value>de-CH</value>
            </prop>
        </node>
        <node oor:name="HunSpellDic_en_US" oor:op="fuse">
            <prop oor:name="Locations" oor:type="oor:string-list">
                <value>%origin%/dictionaries/en_US.aff
%origin%/dictionaries/en_US.dic</value>
            </prop>
            <prop oor:name="Format" oor:type="xs:string">
                <value>DICT_SPELL</value>
            </prop>
            <prop oor:name="Locales" oor:type="oor:string-list">
                <value>en-US</value>
            </prop>
        </node>
    </node>

Especially this will easily allow to use the very same dictionary for
more than one language. And a dictinary can only support one single
format. The 'Locations' entry specifies where to fin the files.
How the entry is to look like may depend upon the actual spell checker
implementation though. It migt not be necessary to list all files needed
but it would probably be safe in the odd case that more than one spell
checker implementation is going to use the same dictionary.

You may have noticed by now that there is actually no direct connection
from the dictionaries to the spell checker.
The only link is the indirect connection by the format name.

Thus in SvtlinguConfig code has already been added that can be used by a
spell checker to get the list of all dictionaries (within all paths
listed for the 'Linguistic' path) that implement a specific format. By
calling the respective function the spell checker can immediately get
the list of dictionaries he can make use of.

Thus I think Caolan approach to auto-detect the available dictionries
can easily be joined with our planned setup. There are only a limited
number of things to take care of:
- the paths where to auto-detect installed dictionaries need to be
  added to be added to the list of 'Linguistic' paths.
- We should not mess up a single path with different content as was
  done already in */user/wordbook were originally only the personal
  dictionaries belonged and later on the downloaded dictionaries
  for the linguistic were placed as well.
  And even worse dictionaries with different content had now the
  same extensions and were placed into the same directory. *ouch*
- We need to define an order for precedence in case a dictionary (or
  better different versions of it) are installed in different places.
  Only one should be used...

So what do you think Caolan?
Can both of our solutions be joined? I think it should easily be possible.

There is one thing I wonder about though:
When you auto-detect those dictionaries aren't you indirectly making use
of the code that maintains the 'DataFilesChangedCheckValue' value in the
configuration (and of course the old/current linguistic entries).
We actaully wanted to remove those code parts after OOo 3.0 since with
configuration entries we no longer need to check what files are actually
installed on the hard disk. We wanted to save even that occasionally
required amount of time to scan for dictionaries...

I think I have not missed anything important of the changes on our side.
Thus I'm eager to hear the thoughts of both of you about joining forces
here.

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[dev] Dictionaries for spell checking etc... (was: Re: [dev] Where our products install to)

Reply via email to