[jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Yakov Sirotkin (JIRA) Tue, 04 Apr 2017 00:39:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954722#comment-15954722
 ]


Yakov Sirotkin commented on LUCENE-7639:
----------------------------------------

Many thanks to all for feedback, here is the list of changes in 
suffix-array-2.patch:

1. Suffix Array construction implemented without recursion, it fixes major bug 
discovered by {{TestIndexWriter.testWickedLongTerm}} test.
2. Sort wordIds instead of words  - words are already sorted in index. 
3. {{SegmentTermsEnum}} used inside {{ListTermsEnum}}.
4. Entire Suffix Array construction moved to special thread to avoid startup 
delays.
5. Properties renamed to {{lucene.suffixArray.enable}} and 
{{lucene.suffixArray.initializationThreadsCount}}.
6. If {{lucene.suffixArray.initializationThreadsCount}} set to {{0}}, 
initialization is synchronous, additional {{ExecutorService}} is not created.  
7. {{CompiledAutomaton}} used instead of Java's {{Pattern}}.
8. Additional flag {{lucene.suffixArray.optimizeForUTF}} with default value 
{{true}} was added. If it is set to {{false}}, we assume that index can contain 
any bytes,
not necessary representing UTF characters. In this case code starts to pass 
some tests, but for real application it increase memory consumption 
twice and reduce performance. 

> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array-2.patch, suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so 
> request processing speed falls down. This problem can be solved with Suffix 
> Array approach. Luckily, Suffix Array can be constructed after Lucene start 
> from existing index. Unfortunately, Suffix Arrays requires a lot of RAM so we 
> can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several 
> threads, so we can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it 
> works for all terms starting with asterisks with at least 3 consequent 
> non-wildcard characters. This patch do not change search results and  affects 
> only performance issues.
> *Update*
> suffix-arra-2.patch is an improved version of the first patch, system 
> properties for it are following::
> {{lucene.suffixArray.enable}} - {{true}}, if you want to enable Suffix Array 
> support. Default value - {{false}}.
> {{lucene.suffixArray.initializationThreadsCount}} - number of threads for 
> Suffix Array initialization, if you set {{0}} - no additional threads used. 
> Default value - {{5}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks

Reply via email to