Re: switching from Hunspell to Morfologik

Dawid Weiss Mon, 13 Oct 2014 11:30:44 -0700

This is a valid FSA file, but not a valid encoding for the dictionary
you're trying to dump, Jan. That's why you're getting an exception.
For example this entry:


AAA+I

with SUFFIX encoder (which your .info file implicitly picks) this
means to truncate 8 bytes from the sequence, which is clearly wrong.
It seems to me that you have data that shouldn't be encoded with
anything (and isn't) -- perhaps the LT colleagues can follow-up with
this one. The wiki page at:

http://wiki.languagetool.org/hunspell-support

indeed should clarify the encoder property for the associated .info file as:

fsa.dict.encoder=NONE

if you comment out these obsolete properties from your .info file:

#fsa.dict.uses-prefixes=false
#fsa.dict.uses-infixes=false

and add the above one, the dictionary dumps just fine. In any case,
you can always dump *any* FSA dictionary without applying the decoding
routines; just use:

java -jar morfologik-tools-1.10.0-SNAPSHOT-standalone.jar  fsa_dump -d
<dict> --raw-data

If you do want to "decode" the data, pass an additional "-x", although
if the underlying data doesn't make sense, exceptions may occur (no
runtime checks are done to verify sanity for performance reasons).

Dawid

On Mon, Oct 13, 2014 at 3:59 PM, Jan Schreiber
<jan.schrei...@languagetool.org> wrote:
> In case anyone's interested in the exported plain text file, it is here:
> http://sourceforge.net/projects/germandict/files/Morfologik/de_frequency.7z
>
> I sorted the words by frequency class and additionally sorted the
> largest "A" class of least frequent words by word length.
>
> The frequency distribution for the first 200,000 words looks fairly
> plausible, but the vast majority (about 1.4 million word forms) is
> lumped together in one huge class.
>
> Ruud, you said you have larger frequency data sets available for most of
> the languages. If you happen to have data for German available I would
> love to have it, ideally in the gaia format so I don't have to hassle
> with converting it. But a tab-separated list or something like that
> would also be great.
>
> --Jan
>
> Am 12.10.2014 18:18, schrieb Jan Schreiber:
>> I figured out how to dump the dictionary. All I had to do was create a
>> hunspell subfolder and move the binary dictionary into it, then the
>> exporting process worked as advertised.
>
> ------------------------------------------------------------------------------
> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: switching from Hunspell to Morfologik

Reply via email to