[jira] [Commented] (LUCENE-2341) explore morfologik integration

Dawid Weiss (JIRA) Tue, 21 Jun 2011 23:58:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053079#comment-13053079
 ]


Dawid Weiss commented on LUCENE-2341:
-------------------------------------

bq. Dawid, do you think it's reasonable to optimize further and use directly a 
list returned by IStemmer.lookup (instead of copying with addAll) ? My concern 
is that (at least in current DictionaryLookup implementation) that list seems 
to be shared by distinct invocations of the lookup method, which would make the 
use of a specific IStemmer not applicable in thread-safe code.

IStemmer implementations are not thread safe anyway, so there is no problem in 
reusing that list. In fact, the returned WordData objects are reused internally 
as well, so you can't store them either (this is done to avoid GC overhead). 

So yes: I missed that, but you'll need to ensure IStemmer instances are not 
shared. This can be done in various ways (thread local, etc), but I think the 
simplest way to do it would be to instantiate PolishStemmer at the 
MorfologikFilter level. This is cheap (the dictionary is loaded once anyway). 

You can then create two constructors in the analyzer -- one with 
PolishStemmer.DICTIONARY and one with the default (I'd suggest MORFOLOGIK). 
Exposing IStemmer constructor will do more harm than good -- thinking ahead is 
good, but in this case I don't think there'll be this many people interested in 
subclassing IStemmer (if anything, they'll plug into Lucene's infrastructure 
directly).

A simple test case spawning 5 or 10 threads in a parallel executor and 
crunching stems on the same analyzer would also be nice to ensure we have 
everything correct wrt multithreading, but it's not that crucial if you don't 
have the time to write it.

Thanks!

> explore morfologik integration
> ------------------------------
>
>                 Key: LUCENE-2341
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2341
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Dawid Weiss
>         Attachments: LUCENE-2341.diff, LUCENE-2341.diff, 
> morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

Reply via email to