[
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052421#comment-13052421
]
Dawid Weiss commented on LUCENE-2341:
-------------------------------------
I did some analyses on both dictionaries.
{noformat}
Number of lines (distict surface forms):
3.662.366 morfologik.utf8
5.086.141 sgjp.utf8
Distinct words (not in both):
2.729.334 unique.utf8
- upper/lower case (morfologik has upper case forms, morfeusz only lower case
surface forms)
acerze
Acerze
- very rare or jargon;
abszminka
abszytowałem
acetobakteria
acetarsolowi
niebombiasto
hakatystce
hakatystycznościach
warzże
- differences in spelling;
abelard
abélard
- acronyms and super-short stuff
aap
aar
Dictinct normalized (lowercase):
2.564.366 lowered.utf8
Most of these are very infrequent words or inflection forms. There are minor
differences or
missing surface forms in both dictionaries, as in here (mz - morfeusz, mk -
morfologik):
mz> hakersko
mz> hakerskość
mz> hakerskości
mz> hakerskością
mz> hakerskościach
mz> hakerskościami
mz> hakerskościom
mk> hakerstw
mk> hakerstwa
...
mk> hakowałyśmy
mk> hakowań
mk> hakowaniach
mk> hakowaniami
mk> hakowaniom
mz> hakowatość
mz> hakowatości
mz> hakowatością
mz> hakowatościach
mz> hakowatościami
mz> hakowatościom
{noformat}
So... the conclusion is pretty consistent with Zipf's law: both dictionaries
have a fairly different coverage, even if they're quite large. We don't have a
frequency dictionary for Polish, but I assume most of these surface forms are
purely theoretical and occur super-rarely in practice. This said, I think we
should use BOTH dictionaries -- after all there's no harm done if we overdo the
lemmatization process a little bit, is there?
So... my proposal would be this: I'll integrate Morfeusz's dictionary in
Morfologik (as an alternative dictionary one can load and use).
Eventually it would be probably sensible to limit the automaton for use in
Lucene to store surface forms and lemmas only (no POS tags) and merge both
dictionaries into a single automaton... but this can be a future improvement.
> explore morfologik integration
> ------------------------------
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Robert Muir
> Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option
> for users.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]