[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-08 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061804#comment-13061804
 ] 

Eks Dev commented on LUCENE-3289:
-

bq. The strings are extremely long (more like short documents) and probably 
need to be compressed in some different datastructure, e.g. a word-based one?

That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago 
we used one trie, for all unique terms to get prefix/edit distance on words and 
one word-trie (symbols were words via symbol table) for documents. I am sure 
this would cut memory requirements significantly for multiword cases when 
compared to char level FST.
e.g. TermDictionary that supports ord() could be used as a symbol table.






 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061195#comment-13061195
 ] 

Michael McCandless commented on LUCENE-3289:


NOTE: patch applies to 3.x.

I ran the patch on the titles, varying the max prefix sharing length:

||Len||FST Size||Seconds||
|1|135446807|8.2|
|2|137632702|8.5|
|3|135177994|8.3|
|4|132782016|8.3|
|5|130415331|8.4|
|6|128086200|8.0|
|7|125797396|8.2|
|8|123552157|8.5|
|9|121358375|8.4|
|10|119228942|8.1|
|11|117181180|8.8|
|12|115229788|8.7|
|13|113388260|9.5|
|14|111664442|9.0|
|15|110059167|9.2|
|16|108572519|9.7|
|17|107201905|9.8|
|18|105942576|10.3|
|19|104791497|10.1|
|20|103745678|11.1|
|21|102801693|10.8|
|22|101957797|11.4|
|23|101206564|11.1|
|24|100541849|11.0|
|25|99956443|11.1|
|26|99443232|12.9|
|27|98995194|13.2|
|28|98604680|13.9|
|29|98264184|13.5|
|30|97969241|13.6|
|31|97714049|13.8|
|32|97494104|14.3|
|33|97304045|14.0|
|34|97140033|14.3|
|35|96998942|14.6|
|36|96877590|16.5|
|37|96773039|16.9|
|38|96682961|16.6|
|39|96605160|17.8|
|40|96537687|18.3|
|41|96479286|17.8|
|42|96428710|17.5|
|43|96384659|18.9|
|44|96346174|17.0|
|45|96312826|19.3|
|46|96283545|17.8|
|47|96257708|19.4|
|48|96235159|19.0|
|49|96215220|18.7|
|50|96197450|19.6|
|51|96181539|17.3|
|52|96167235|16.9|
|53|96154490|17.7|
|54|96143081|18.8|
|55|96132905|17.4|
|56|96123776|17.5|
|57|96115462|20.7|
|58|96108051|19.2|
|59|96101249|19.1|
|60|96095107|18.7|
|ALL|96020343|22.5|

Very very odd that FST size first goes up at N=2... not yet sure why.  But from 
this curve it looks like there is a sweet spot around maybe N=24.  I didn't 
measure required heap here, but it also will go down as N goes down.


 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061453#comment-13061453
 ] 

Robert Muir commented on LUCENE-3289:
-

I think thats probably good for most cases?

In the example you gave, it seems that FST might not be the best algorithm? The 
strings are extremely long (more like short documents) and probably need to be 
compressed in some different datastructure, e.g. a word-based one?

 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061456#comment-13061456
 ] 

Michael McCandless commented on LUCENE-3289:


Yeah I think costly but perfect minimization is the right default.

 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes

2011-07-07 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061512#comment-13061512
 ] 

Dawid Weiss commented on LUCENE-3289:
-

Exactly. This is a very specific use case (long suggestions).

 FST should allow controlling how hard builder tries to share suffixes
 -

 Key: LUCENE-3289
 URL: https://issues.apache.org/jira/browse/LUCENE-3289
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3289.patch, LUCENE-3289.patch


 Today we have a boolean option to the FST builder telling it whether
 it should share suffixes.
 If you turn this off, building is much faster, uses much less RAM, and
 the resulting FST is a prefix trie.  But, the FST is larger than it
 needs to be.  When it's on, the builder maintains a node hash holding
 every node seen so far in the FST -- this uses up RAM and slows things
 down.
 On a dataset that Elmer (see java-user thread Autocompletion on large
 index on Jul 6 2011) provided (thank you!), which is 1.32 M titles
 avg 67.3 chars per title, building with suffix sharing on took 22.5
 seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
 sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
 I think we should allow this boolean to be shade-of-gray instead:
 usually, how well suffixes can share is a function of how far they are
 from the end of the string, so, by adding a tunable N to only share
 when suffix length  N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org