[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019932#comment-13019932 ] Robert Muir commented on SOLR-2378: --- Just an idea: should we default to this implementation in trunk? It seems to be a significant reduction in RAM. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - -benchmark again- - -modify the tutorial on the wiki- [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016805#comment-13016805 ] Dawid Weiss commented on SOLR-2378: --- Well spotted, Robert -- indeed, three-byte codepoints were throwing automaton exceptions. I've added a test for this. I also added exact match promotion to the top of the suggestions list, regardless of the score of the exact match. This is controlled by a final flag at the moment... maybe it should become a parameter, I don't know. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016820#comment-13016820 ] Dawid Weiss commented on SOLR-2378: --- Ok, updated patch. The only thing I would like to add is a real Solr handler test, much like SuggesterTest. Should I add a separate test class or simply add another handler to that config file and test methods to SuggesterTest? Also, this one puzzled me: {code} threshold = config.get(THRESHOLD_TOKEN_FREQUENCY) == null ? 0.0f : (Float)config.get(THRESHOLD_TOKEN_FREQUENCY); {code} What are the conversion rules for NamedList that SolrSpellChecker gets in init()? I have an Integer parameter, but didn't check what is returned for, say, 12 (String, Float, Integer?). FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016372#comment-13016372 ] Dawid Weiss commented on SOLR-2378: --- I've been waiting for somebody to look at this patch, guys, just to confirm that everything is fine with it. If so, I'd like to commit it in and move on to infix suggestions support maybe. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016378#comment-13016378 ] Robert Muir commented on SOLR-2378: --- Took a quick look: Builder.add(char[], int, int, ..) adds codepoints (Character.codePointAt/Character.charCount) [utf-32 order] but the comparator you use when building the automaton compares characters [utf-16 order]. so if someone has a term in the supplementary range in their index, the order will be inconsistent. So I think the comparator should just compare codepoints (it should iterate with codePointAt/charCount too)? FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016381#comment-13016381 ] Yonik Seeley commented on SOLR-2378: If it causes too much of a lookup performance hit, the Builder could just build in utf-16 order too? FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016384#comment-13016384 ] Robert Muir commented on SOLR-2378: --- I am referring to build-time, not runtime here. run-time can handle supplementary characters wrong and I wouldn't object to committing it, but currently if someone has terms 0x in their index it will preventing the FST from being built at all and suggesting will not work? (as i think the FST will throw exc?) FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016481#comment-13016481 ] Dawid Weiss commented on SOLR-2378: --- Oh, right -- I didn't peek at the inside of Builder.add(char[],...), but I will verify this by trying to add something that has multilingual stuff in it -- will update the patch tomorrow, hopefully. I would also love to have somebody who actually uses suggestions to try to compile it and use it on a production data set to see if my benchmark was approximately right with respect to the speed differences between the different available implementations. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015902#comment-13015902 ] Dawid Weiss commented on SOLR-2378: --- This is a git patch (trim the first level to apply without git, should work out of the box) against the trunk that implements FST-based suggester. I've also added a more realistic benchmark that shows performance limitations of the current suggester implementations, especially in the onlyMorePopular mode. Some benchmarks from my machine: Time to create internal data structures out of 50.000 suggestion terms (692kB of UTF8): {noformat} JaspellLookup input: 50,000, time[ms]: 30 [+- 4.93] TSTLookup input: 50,000, time[ms]: 39 [+- 2.56] FSTLookup input: 50,000, time[ms]: 62 [+- 2.52] {noformat} Memory used: {noformat} JaspellLookup size[B]:7,868,863 TSTLookup size[B]:7,914,144 FSTLookup size[B]: 300,148 {noformat} Traversal speed (randomized prefixes of input sequences). We start with full matches: {noformat} -- prefixes: 100-200, num: 7, onlyMorePopular: true JaspellLookup queries: 5, time[ms]: 108 [+- 6.23], ~qps: 462 TSTLookup queries: 5, time[ms]: 54 [+- 1.54], ~qps: 922 FSTLookup queries: 5, time[ms]: 66 [+- 1.30], ~qps: 761 {noformat} These results are overly optimistic because users rarely type in something in full; let's cut the prefix length to 6-9 chars: {noformat} -- prefixes: 6-9, num: 7, onlyMorePopular: true JaspellLookup queries: 5, time[ms]: 155 [+- 4.20], ~qps: 322 TSTLookup queries: 5, time[ms]: 208 [+- 3.99], ~qps: 241 FSTLookup queries: 5, time[ms]: 71 [+- 1.36], ~qps: 700 {noformat} Clearly, the FST is more resilient to the length of the input prefix... let's cut it to 2-4 characters: {noformat} -- prefixes: 2-4, num: 7, onlyMorePopular: true JaspellLookup queries: 5, time[ms]: 420 [+- 5.37], ~qps: 119 TSTLookup queries: 5, time[ms]: 1687 [+- 10.96], ~qps: 30 FSTLookup queries: 5, time[ms]: 90 [+- 2.27], ~qps: 554 {noformat} I didn't have the time to look into it, but TST's result is surprisingly bad with such short prefixes. FST keeps nearly the same perf. level. In the current implementation I throw an exception if somebody tries to get suggestions from FST without sorting by weights (this is equivalent to building the suggester with a single weight for all terms). This could be implemented, but at a small performance penalty -- to be considered if you think it is useful. The above performance problems for short prefixes are interestingly present even with onlyMorePopular set to false: {noformat} -- prefixes: 100-200, num: 7, onlyMorePopular: false JaspellLookup queries: 5, time[ms]: 94 [+- 6.45], ~qps: 532 TSTLookup queries: 5, time[ms]: 46 [+- 0.98], ~qps: 1092 -- prefixes: 6-9, num: 7, onlyMorePopular: false JaspellLookup queries: 5, time[ms]: 123 [+- 3.12], ~qps: 405 TSTLookup queries: 5, time[ms]: 188 [+- 3.03], ~qps: 266 -- prefixes: 2-4, num: 7, onlyMorePopular: false JaspellLookup queries: 5, time[ms]: 225 [+- 5.69], ~qps: 222 TSTLookup queries: 5, time[ms]: 1523 [+- 16.69], ~qps: 33 {noformat} Peek at the code and let me know if you can think of any improvements/ modifications, especially wrt Solr infrastructure (I specifically didn't implement any real solr instance test case). FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Attachments: SOLR-2378.patch Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - -write a DFA based suggester effectively identical to ternary tree based solution right now,- - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)- - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)- - -benchmark again- - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail:
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015627#comment-13015627 ] Dawid Weiss commented on SOLR-2378: --- I'm done for the day. Lots of benchmarking and tweaking. Conclusions: - Use raw utf16 values of input strings as they are ready to use in the input Strings and don't need to be converted. The automaton is slightly bigger, but not that much (read: most likely because English ascii chars fall into single byte vint range anyway). - I created a non-recursive and recursive version of the main tail-state traversal loop, but the gain is minimal. - I have a relatively fast machine (i7, quad core) and parallel GC combined with tlabs makes local allocations nearly zero-cost. Locally reused structures speed up by 5%, but increase code complexity greatly. - The benchmark currently in the codebase is very skewed. I tried on real life data (that I cannot share, unfortunately) and the results are: {noformat} TSTLookupsize[B]=12,598,794 FSTLookupsize[B]=472,679 * Running 10 iterations for TSTLookup ... - warm-up 5 iterations... - main iterations: TSTLookupbuildTime[ms]=19 lookupTime[ms]=4,813 * Running 10 iterations for FSTLookup ... - warm-up 5 iterations... - main iterations: FSTLookupbuildTime[ms]=68 lookupTime[ms]=263 {noformat} It looks as if I've made a mistake, but not really. With the actual data the fan-out rate is much higher than on long numbers. Also, real terms are shorter than longs and 75% of their prefix is usually a 2-3 letter combination resulting in lots of matches that need to be passed through the priority queue, etc. With the FST weights are part of the search structure (approximated weights, to be exact) and so the cost of building it is higher, but the gain at runtime is substantial. The lookup time should be virtually constant (with very small deviation). - JaspellLookup does not pass the tests if the input prefix has a space inside (?). TSTLookup and FSTLookup work just fine. - I compared against automata API in Morfologik; the speedup is about 30% (only ints are passed, no Arc object decompression, etc.). But then -- the API of Lucene is more flexible. I'll clean up the code and create a final patch for testing/ evaluation tomorrow. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - write a DFA based suggester effectively identical to ternary tree based solution right now, - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code) - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case) - benchmark again - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015631#comment-13015631 ] Michael McCandless commented on SOLR-2378: -- Wow those improvements are awesome -- FST 26.7X smaller RAM footprint, 18X faster lookups, but build time is 3.6X slower. This is built on a composite reader, right? Does the build time include the time to enum the terms from MultiTermsEnum? FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - write a DFA based suggester effectively identical to ternary tree based solution right now, - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code) - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case) - benchmark again - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015634#comment-13015634 ] Dawid Weiss commented on SOLR-2378: --- The build time needs to sort the input again (and create it in the first place). Because Lookup API assumes suggestion keywords can come from a variety of sources there is no guarantee they will be sorted, so we need to sort them before we can build the automaton. Still, I think the numbers are acceptable... if you need on-line construction of these suggestions you'll pick TST (it can add new keywords to the structure dynamically); for a batch-load suggester you'd pick the FST one. It is also very likely that I overlooked something that could bring those numbers down, I'll create a clean patch tomorrow, so everything will be out there for improving. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - write a DFA based suggester effectively identical to ternary tree based solution right now, - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code) - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case) - benchmark again - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015011#comment-13015011 ] Dawid Weiss commented on SOLR-2378: --- Soliciting feedback on the following questions: - suggesters currently have float weights associated with terms; can these floats be bucketed and returned in approximation or do they need to be exact copies of the input? For automata, bucketed weights (to, let's say 5-10 different values) provide terrific speed/size improvements, so if this is not a rigid requirement, I'd use them. - is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)? I've implemented preliminary FST-based lookup (without weights yet). Speed-wise it doesn't rock because data is converted to/from utf8 on input/output and sorted during construction, but it is still acceptable, even at this early stage I think: {noformat} JaspellLookupbuildTime[ms]=112 lookupTime[ms]=288 TSTLookupbuildTime[ms]=115 lookupTime[ms]=103 FSTLookupbuildTime[ms]=464 lookupTime[ms]=145 {noformat} now... that was speed only, check out the in-memory size :) {noformat} JaspellLookupsize[B]=81,078,997 TSTLookupsize[B]=53,453,696 FSTLookupsize[B]=2,909,396 {noformat} (This benchmark stores very limited vocabulary items -- long numbers only, so it is skewed from reality, but it's nice to see something like this, huh?). FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - write a DFA based suggester effectively identical to ternary tree based solution right now, - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code) - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case) - benchmark again - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015053#comment-13015053 ] Ryan McKinley commented on SOLR-2378: - bq. is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)? Components are initalized at startup and the same instance is used for every request (multi-threaded) If you need to use the same obejects in prepare and process, you can either put then in the request context map, or add something to ResponseBuilder (perhaps better if this is widely used) FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases: - write a DFA based suggester effectively identical to ternary tree based solution right now, - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code) - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case) - benchmark again - add infix suggestion support with prefix matches boosted higher (?) - benchmark again - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013932#comment-13013932 ] Dawid Weiss commented on SOLR-2378: --- I didn't have time to take care of this until now, apologies. So, looking at Lookup#lookup(), I just wanted to clarify: {code} /** * Look up a key and return possible completion for this key. * @param key lookup key. Depending on the implementation this may be * a prefix, misspelling, or even infix. * @param onlyMorePopular return only more popular results * @param num maximum number of results to return * @return a list of possible completions, with their relative weight (e.g. popularity) */ public abstract ListLookupResult lookup(String key, boolean onlyMorePopular, int num); {code} the onlyMorePopular means more popular than... what? I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess). This makes sense, but onlyMorePopular is misleading -- it should be called onlyMostPopular (those with the native knowledge of English subtlieties, speak up if I'm right here). I also see and wanted to confirm -- the Dictionary can come from various sources, so we can't rely on the presence of the built-in Lucene automaton, can we? Even if I wanted to reuse it, there'd be no easy way to determine if it's a full automaton, or a partial one (because of the gaps/trimming)... I think I'll just implement the solution by building the automaton from whatever Dictionary comes in and serializing/ deserializing it similar to TSTLookup. Sounds ok? FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013933#comment-13013933 ] Andrzej Bialecki commented on SOLR-2378: - bq. I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess) Correct. I agree that the name is so-so, I inherited it from the spellchecker API - feel free to change it. bq. the Dictionary can come from various sources, ... Yes. This is again a legacy of the Lucene SpellChecker API. I tried to extend this API in Solr without changing it in Lucene (see the *IteratorWrapper classes and TermFreqIterator) but ultimately it would be better to refactor this. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Assignee: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997702#comment-12997702 ] Dawid Weiss commented on SOLR-2378: --- Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997703#comment-12997703 ] Uwe Schindler commented on SOLR-2378: - This time I cannot help - I don't have admin rights in SOLR! In my opinion, we should merge the user rights of both projects. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997786#comment-12997786 ] Mark Miller commented on SOLR-2378: --- bq. Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself You should have a committer role now. FST-based Lookup (suggestions) for prefix matches. -- Key: SOLR-2378 URL: https://issues.apache.org/jira/browse/SOLR-2378 Project: Solr Issue Type: New Feature Components: spellchecker Reporter: Dawid Weiss Labels: lookup, prefix Fix For: 4.0 Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org