[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061804#comment-13061804 ] Eks Dev commented on LUCENE-3289: - bq. The strings are extremely long (more like short documents) and probably need to be compressed in some different datastructure, e.g. a word-based one? That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago we used one trie, for all unique terms to get prefix/edit distance on words and one word-trie (symbols were words via symbol table) for documents. I am sure this would cut memory requirements significantly for multiword cases when compared to char level FST. e.g. TermDictionary that supports ord() could be used as a symbol table. FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch, LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061195#comment-13061195 ] Michael McCandless commented on LUCENE-3289: NOTE: patch applies to 3.x. I ran the patch on the titles, varying the max prefix sharing length: ||Len||FST Size||Seconds|| |1|135446807|8.2| |2|137632702|8.5| |3|135177994|8.3| |4|132782016|8.3| |5|130415331|8.4| |6|128086200|8.0| |7|125797396|8.2| |8|123552157|8.5| |9|121358375|8.4| |10|119228942|8.1| |11|117181180|8.8| |12|115229788|8.7| |13|113388260|9.5| |14|111664442|9.0| |15|110059167|9.2| |16|108572519|9.7| |17|107201905|9.8| |18|105942576|10.3| |19|104791497|10.1| |20|103745678|11.1| |21|102801693|10.8| |22|101957797|11.4| |23|101206564|11.1| |24|100541849|11.0| |25|99956443|11.1| |26|99443232|12.9| |27|98995194|13.2| |28|98604680|13.9| |29|98264184|13.5| |30|97969241|13.6| |31|97714049|13.8| |32|97494104|14.3| |33|97304045|14.0| |34|97140033|14.3| |35|96998942|14.6| |36|96877590|16.5| |37|96773039|16.9| |38|96682961|16.6| |39|96605160|17.8| |40|96537687|18.3| |41|96479286|17.8| |42|96428710|17.5| |43|96384659|18.9| |44|96346174|17.0| |45|96312826|19.3| |46|96283545|17.8| |47|96257708|19.4| |48|96235159|19.0| |49|96215220|18.7| |50|96197450|19.6| |51|96181539|17.3| |52|96167235|16.9| |53|96154490|17.7| |54|96143081|18.8| |55|96132905|17.4| |56|96123776|17.5| |57|96115462|20.7| |58|96108051|19.2| |59|96101249|19.1| |60|96095107|18.7| |ALL|96020343|22.5| Very very odd that FST size first goes up at N=2... not yet sure why. But from this curve it looks like there is a sweet spot around maybe N=24. I didn't measure required heap here, but it also will go down as N goes down. FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061453#comment-13061453 ] Robert Muir commented on LUCENE-3289: - I think thats probably good for most cases? In the example you gave, it seems that FST might not be the best algorithm? The strings are extremely long (more like short documents) and probably need to be compressed in some different datastructure, e.g. a word-based one? FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch, LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061456#comment-13061456 ] Michael McCandless commented on LUCENE-3289: Yeah I think costly but perfect minimization is the right default. FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch, LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes
[ https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061512#comment-13061512 ] Dawid Weiss commented on LUCENE-3289: - Exactly. This is a very specific use case (long suggestions). FST should allow controlling how hard builder tries to share suffixes - Key: LUCENE-3289 URL: https://issues.apache.org/jira/browse/LUCENE-3289 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3289.patch, LUCENE-3289.patch Today we have a boolean option to the FST builder telling it whether it should share suffixes. If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST -- this uses up RAM and slows things down. On a dataset that Elmer (see java-user thread Autocompletion on large index on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST. I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length N, we can let caller make reasonable tradeoffs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org