[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052421#comment-13052421
 ] 

Dawid Weiss commented on LUCENE-2341:
-------------------------------------

I did some analyses on both dictionaries.
{noformat}
Number of lines (distict surface forms):

  3.662.366 morfologik.utf8
  5.086.141 sgjp.utf8

Distinct words (not in both):

  2.729.334 unique.utf8

  - upper/lower case (morfologik has upper case forms, morfeusz only lower case 
surface forms)
    
    acerze
    Acerze

  - very rare or jargon;

    abszminka
    abszytowałem
    acetobakteria
    acetarsolowi
    niebombiasto
    hakatystce
    hakatystycznościach
    warzże

  - differences in spelling;

    abelard
    abélard

  - acronyms and super-short stuff

    aap
    aar

Dictinct normalized (lowercase):

  2.564.366 lowered.utf8

  Most of these are very infrequent words or inflection forms. There are minor 
differences or
  missing surface forms in both dictionaries, as in here (mz - morfeusz, mk - 
morfologik):

mz> hakersko
mz> hakerskość
mz> hakerskości
mz> hakerskością
mz> hakerskościach
mz> hakerskościami
mz> hakerskościom
mk> hakerstw
mk> hakerstwa
...
mk> hakowałyśmy
mk> hakowań
mk> hakowaniach
mk> hakowaniami
mk> hakowaniom
mz> hakowatość
mz> hakowatości
mz> hakowatością
mz> hakowatościach
mz> hakowatościami
mz> hakowatościom
{noformat}

So... the conclusion is pretty consistent with Zipf's law: both dictionaries 
have a fairly different coverage, even if they're quite large. We don't have a 
frequency dictionary for Polish, but I assume most of these surface forms are 
purely theoretical and occur super-rarely in practice. This said, I think we 
should use BOTH dictionaries -- after all there's no harm done if we overdo the 
lemmatization process a little bit, is there?

So... my proposal would be this: I'll integrate Morfeusz's dictionary in 
Morfologik (as an alternative dictionary one can load and use). 

Eventually it would be probably sensible to limit the automaton for use in 
Lucene to store surface forms and lemmas only (no POS tags) and merge both 
dictionaries into a single automaton... but this can  be a future improvement.



> explore morfologik integration
> ------------------------------
>
>                 Key: LUCENE-2341
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2341
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Dawid Weiss
>         Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to