[
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868038#action_12868038
]
Robert Muir commented on LUCENE-2413:
-------------------------------------
bq. May this much faster than CharArraySet
I ran indexing tests a while ago (reuters) with CharArraySet itself implemented
with a DFA, and it was slightly faster, but not much. I think this is because
english words are usually not very long (average length=5). For other languages
this technique might save some cpu time, but there are some "problems" i imagine
# building an automaton from a list of words is more expensive, although Dawid
Weiss has implemented an addition to automaton that does this fast.
# in general building automaton and runautomaton etc is more "heavy" i would
think, but Mike Mccandless hacked away a lot of this heaviness when we
converted to UTF-32.
# the CharacterRunAutomaton is not optimized right now, we disabled the
classmap[] for chars because it consume more RAM. I think if we were to care
about performance on char[] we should make it classmap 0x0-0xffff and binary
search the rest, or something similar. currently it binarysearches on each
input character.
Somewhat related, a while ago i tested this with CharArraySet as a DFA, and
opened this issue: LUCENE-2227. But obviously this is not the only way, as this
example shows filtering on the dfa itself (and not using chararrayset at all).
So in general, i have those concerns right now, but maybe in the future once
some things are addressed we could at least make an optional stopfilter impl or
something similar.
One thing i like about this filter personally, is that rejected terms always
get (optionally) the posInc increased... I do not think our existing KeepWord
or LengthFilters do this, but maybe i am wrong.
> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> -------------------------------------------------------------------
>
> Key: LUCENE-2413
> URL: https://issues.apache.org/jira/browse/LUCENE-2413
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch,
> LUCENE-2413_commongrams.patch, LUCENE-2413_folding.patch,
> LUCENE-2413_htmlstrip.patch, LUCENE-2413_keep_hyphen_trim.patch,
> LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch,
> LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch,
> LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch,
> LUCENE-2413_teesink.patch, LUCENE-2413_testanalyzer.patch,
> LUCENE-2413_testanalyzer.patch, LUCENE-2413_tests2.patch,
> LUCENE-2413_wdf.patch
>
>
> We've been wanting to do this for quite some time now... I think, now that
> Solr/Lucene are merged, and we're looking at opening an unstable line of
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately
> version the analyzers from which version of Solr/Lucene they use, possibly
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from
> the analysis API), but I don't think that issue needs to block this
> consolidation.
> Once we do this, there is one place where our users can find all the
> analyzers that Solr/Lucene provide.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]