[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019932#comment-13019932
 ] 

Robert Muir commented on SOLR-2378:
---

Just an idea: should we default to this implementation in trunk?
It seems to be a significant reduction in RAM.


 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - -benchmark again-
 - -modify the tutorial on the wiki- [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-07 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016805#comment-13016805
 ] 

Dawid Weiss commented on SOLR-2378:
---

Well spotted, Robert -- indeed, three-byte codepoints were throwing automaton 
exceptions. I've added a test for this. I also added exact match promotion to 
the top of the suggestions list, regardless of the score of the exact match. 
This is controlled by a final flag at the moment... maybe it should become a 
parameter, I don't know.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-07 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016820#comment-13016820
 ] 

Dawid Weiss commented on SOLR-2378:
---

Ok, updated patch. The only thing I would like to add is a real Solr handler 
test, much like SuggesterTest. Should I add a separate test class or simply add 
another handler to that config file and test methods to SuggesterTest?

Also, this one puzzled me:
{code}
threshold = config.get(THRESHOLD_TOKEN_FREQUENCY) == null ? 0.0f
: (Float)config.get(THRESHOLD_TOKEN_FREQUENCY);
{code}
What are the conversion rules for NamedList that SolrSpellChecker gets in 
init()? I have an Integer parameter, but didn't check what is returned for, 
say, 12 (String, Float, Integer?).

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-06 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016372#comment-13016372
 ] 

Dawid Weiss commented on SOLR-2378:
---

I've been waiting for somebody to look at this patch, guys, just to confirm 
that everything is fine with it. If so, I'd like to commit it in and move on to 
infix suggestions support maybe.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016378#comment-13016378
 ] 

Robert Muir commented on SOLR-2378:
---

Took a quick look:

Builder.add(char[], int, int, ..) adds codepoints 
(Character.codePointAt/Character.charCount) [utf-32 order] but the comparator 
you use when building the automaton compares characters [utf-16 order]. so if 
someone has a term in the supplementary range in their index, the order will be 
inconsistent.

So I think the comparator should just compare codepoints (it should iterate 
with codePointAt/charCount too)?


 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-06 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016381#comment-13016381
 ] 

Yonik Seeley commented on SOLR-2378:


If it causes too much of a lookup performance hit, the Builder could just build 
in utf-16 order too?

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016384#comment-13016384
 ] 

Robert Muir commented on SOLR-2378:
---

I am referring to build-time, not runtime here.

run-time can handle supplementary characters wrong and I wouldn't object to 
committing it,
but currently if someone has terms  0x in their index it will preventing 
the FST from
being built at all and suggesting will not work? (as i think the FST will throw 
exc?) 


 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-06 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016481#comment-13016481
 ] 

Dawid Weiss commented on SOLR-2378:
---

Oh, right -- I didn't peek at the inside of Builder.add(char[],...), but I will 
verify this by trying to add something that has multilingual stuff in it -- 
will update the patch tomorrow, hopefully. I would also love to have somebody 
who actually uses suggestions to try to compile it and use it on a production 
data set to see if my benchmark was approximately right with respect to the 
speed differences between the different available implementations.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-05 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015902#comment-13015902
 ] 

Dawid Weiss commented on SOLR-2378:
---

This is a git patch (trim the first level to apply without git, should work out 
of the box) against the trunk that implements FST-based suggester. I've also 
added a more realistic benchmark that shows performance limitations of the 
current suggester implementations, especially in the onlyMorePopular mode. 
Some benchmarks from my machine:

Time to create internal data structures out of 50.000 suggestion terms (692kB 
of UTF8):
{noformat}
JaspellLookup   input: 50,000, time[ms]: 30 [+- 4.93]
TSTLookup   input: 50,000, time[ms]: 39 [+- 2.56]
FSTLookup   input: 50,000, time[ms]: 62 [+- 2.52]
{noformat}

Memory used:
{noformat}
JaspellLookup   size[B]:7,868,863
TSTLookup   size[B]:7,914,144
FSTLookup   size[B]:  300,148
{noformat}

Traversal speed (randomized prefixes of input sequences). We start with full 
matches:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: true
JaspellLookup   queries: 5, time[ms]: 108 [+- 6.23], ~qps: 462
TSTLookup   queries: 5, time[ms]: 54 [+- 1.54], ~qps: 922
FSTLookup   queries: 5, time[ms]: 66 [+- 1.30], ~qps: 761
{noformat}

These results are overly optimistic because users rarely type in something in 
full; let's cut the prefix length to 6-9 chars:
{noformat}
-- prefixes: 6-9, num: 7, onlyMorePopular: true
JaspellLookup   queries: 5, time[ms]: 155 [+- 4.20], ~qps: 322
TSTLookup   queries: 5, time[ms]: 208 [+- 3.99], ~qps: 241
FSTLookup   queries: 5, time[ms]: 71 [+- 1.36], ~qps: 700
{noformat}

Clearly, the FST is more resilient to the length of the input prefix... let's 
cut it to 2-4 characters:
{noformat}
-- prefixes: 2-4, num: 7, onlyMorePopular: true
JaspellLookup   queries: 5, time[ms]: 420 [+- 5.37], ~qps: 119
TSTLookup   queries: 5, time[ms]: 1687 [+- 10.96], ~qps: 30
FSTLookup   queries: 5, time[ms]: 90 [+- 2.27], ~qps: 554
{noformat}

I didn't have the time to look into it, but TST's result is surprisingly bad 
with such short prefixes. FST keeps nearly the same perf. level.

In the current implementation I throw an exception if somebody tries to get 
suggestions from FST without sorting by weights (this is equivalent to building 
the suggester with a single weight for all terms). This could be implemented, 
but at a small performance penalty -- to be considered if you think it is 
useful. The above performance problems for short prefixes are interestingly 
present even with onlyMorePopular set to false:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: false
JaspellLookup   queries: 5, time[ms]: 94 [+- 6.45], ~qps: 532
TSTLookup   queries: 5, time[ms]: 46 [+- 0.98], ~qps: 1092
-- prefixes: 6-9, num: 7, onlyMorePopular: false
JaspellLookup   queries: 5, time[ms]: 123 [+- 3.12], ~qps: 405
TSTLookup   queries: 5, time[ms]: 188 [+- 3.03], ~qps: 266
-- prefixes: 2-4, num: 7, onlyMorePopular: false
JaspellLookup   queries: 5, time[ms]: 225 [+- 5.69], ~qps: 222
TSTLookup   queries: 5, time[ms]: 1523 [+- 16.69], ~qps: 33
{noformat}

Peek at the code and let me know if you can think of any improvements/ 
modifications, especially wrt Solr infrastructure (I specifically didn't 
implement any real solr instance test case). 


 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0

 Attachments: SOLR-2378.patch


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - -write a DFA based suggester effectively identical to ternary tree based 
 solution right now,-
 - -baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)-
 - -modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)-
 - -benchmark again-
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: 

[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-04 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015627#comment-13015627
 ] 

Dawid Weiss commented on SOLR-2378:
---

I'm done for the day. Lots of benchmarking and tweaking. Conclusions:
- Use raw utf16 values of input strings as they are ready to use in the input 
Strings and don't need to be converted. The automaton is slightly bigger, but 
not that much (read: most likely because English ascii chars fall into single 
byte vint range anyway).
- I created a non-recursive and recursive version of the main tail-state 
traversal loop, but the gain is minimal.
- I have a relatively fast machine (i7, quad core) and parallel GC combined 
with tlabs makes local allocations nearly zero-cost. Locally reused structures 
speed up by 5%, but increase code complexity greatly.
- The benchmark currently in the codebase is very skewed. I tried on real life 
data (that I cannot share, unfortunately) and the results are:
{noformat}
TSTLookupsize[B]=12,598,794
FSTLookupsize[B]=472,679
* Running 10 iterations for TSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
TSTLookupbuildTime[ms]=19 lookupTime[ms]=4,813
* Running 10 iterations for FSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
FSTLookupbuildTime[ms]=68 lookupTime[ms]=263
{noformat}
It looks as if I've made a mistake, but not really. With the actual data the 
fan-out rate is much higher than on long numbers. Also, real terms are shorter 
than longs and 75% of their prefix is usually a 2-3 letter combination 
resulting in lots of matches that need to be passed through the priority queue, 
etc. With the FST weights are part of the search structure (approximated 
weights, to be exact) and so the cost of building it is  higher, but the gain 
at runtime is substantial. The lookup time should be virtually constant (with 
very small deviation).
- JaspellLookup does not pass the tests if the input prefix has a space inside 
(?). TSTLookup and FSTLookup work just fine.
- I compared against automata API in Morfologik; the speedup is about 30% (only 
ints are passed, no Arc object decompression, etc.). But then -- the API of 
Lucene is more flexible.

I'll clean up the code and create a final patch for testing/ evaluation 
tomorrow.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - write a DFA based suggester effectively identical to ternary tree based 
 solution right now,
 - baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)
 - modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)
 - benchmark again
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015631#comment-13015631
 ] 

Michael McCandless commented on SOLR-2378:
--

Wow those improvements are awesome -- FST 26.7X smaller RAM footprint, 18X 
faster lookups, but build time is 3.6X slower.

This is built on a composite reader, right?  Does the build time include the 
time to enum the terms from MultiTermsEnum?

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - write a DFA based suggester effectively identical to ternary tree based 
 solution right now,
 - baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)
 - modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)
 - benchmark again
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-04 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015634#comment-13015634
 ] 

Dawid Weiss commented on SOLR-2378:
---

The build time needs to sort the input again (and create it in the first 
place). Because Lookup API assumes suggestion keywords can come from a variety 
of sources there is no guarantee they will be sorted, so we need to sort them 
before we can build the automaton. 

Still, I think the numbers are acceptable... if you need on-line construction 
of these suggestions you'll pick TST (it can add new keywords to the structure 
dynamically); for a batch-load suggester you'd pick the FST one.

It is also very likely that I overlooked something that could bring those 
numbers down, I'll create a clean patch tomorrow, so everything will be out 
there for improving.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - write a DFA based suggester effectively identical to ternary tree based 
 solution right now,
 - baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)
 - modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)
 - benchmark again
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-02 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015011#comment-13015011
 ] 

Dawid Weiss commented on SOLR-2378:
---

Soliciting feedback on the following questions:
- suggesters currently have float weights associated with terms; can these 
floats be bucketed and returned in approximation or do they need to be exact 
copies of the input? For automata, bucketed weights (to, let's say 5-10 
different values) provide terrific speed/size improvements, so if this is not a 
rigid requirement, I'd use them.
- is there any info on threading of Solr components? I am in particular looking 
for mutable object fields in the suggester (can a single suggester instance be 
accessed by multiple threads at the same time)?

I've implemented preliminary FST-based lookup (without weights yet). Speed-wise 
it doesn't rock because data is converted to/from utf8 on input/output and 
sorted during construction, but it is still acceptable, even at this early 
stage I think:

{noformat}
JaspellLookupbuildTime[ms]=112 lookupTime[ms]=288
TSTLookupbuildTime[ms]=115 lookupTime[ms]=103
FSTLookupbuildTime[ms]=464 lookupTime[ms]=145
{noformat}

now... that was speed only, check out the in-memory size :)

{noformat}
JaspellLookupsize[B]=81,078,997
TSTLookupsize[B]=53,453,696
FSTLookupsize[B]=2,909,396
{noformat}

(This benchmark stores very limited vocabulary items -- long numbers only, so 
it is skewed from reality, but it's nice to see something like this, huh?).


 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - write a DFA based suggester effectively identical to ternary tree based 
 solution right now,
 - baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)
 - modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)
 - benchmark again
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-04-02 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015053#comment-13015053
 ] 

Ryan McKinley commented on SOLR-2378:
-

bq. is there any info on threading of Solr components? I am in particular 
looking for mutable object fields in the suggester (can a single suggester 
instance be accessed by multiple threads at the same time)?

Components are initalized at startup and the same instance is used for every 
request (multi-threaded)

If you need to use the same obejects in prepare and process, you can either put 
then in the request context map, or add something to ResponseBuilder (perhaps 
better if this is widely used)

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually. Impl. phases:
 - write a DFA based suggester effectively identical to ternary tree based 
 solution right now,
 - baseline benchmark against tern. tree (memory consumption, rebuilding 
 speed, indexing speed; reuse Andrzej's benchmark code)
 - modify DFA to encode term weights directly in the automaton (optimize for 
 onlyMostPopular case)
 - benchmark again
 - add infix suggestion support with prefix matches boosted higher (?)
 - benchmark again
 - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-03-31 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013932#comment-13013932
 ] 

Dawid Weiss commented on SOLR-2378:
---

I didn't have time to take care of this until now, apologies. So, looking at 
Lookup#lookup(), I just wanted to clarify:

{code}
  /**
   * Look up a key and return possible completion for this key.
   * @param key lookup key. Depending on the implementation this may be
   * a prefix, misspelling, or even infix.
   * @param onlyMorePopular return only more popular results
   * @param num maximum number of results to return
   * @return a list of possible completions, with their relative weight (e.g. 
popularity)
   */
  public abstract ListLookupResult lookup(String key, boolean 
onlyMorePopular, int num);
{code}

the onlyMorePopular means more popular than... what? I see TSTLookup and 
JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority 
queue by their associated value (frequency I guess). This makes sense, but 
onlyMorePopular is misleading -- it should be called onlyMostPopular (those 
with the native knowledge of English subtlieties, speak up if I'm right here).

I also see and wanted to confirm -- the Dictionary can come from various 
sources, so we can't rely on the presence of the built-in Lucene automaton, can 
we? Even if I wanted to reuse it, there'd be no easy way to determine if it's a 
full automaton, or a partial one (because of the gaps/trimming)... I think I'll 
just implement the solution by building the automaton from whatever Dictionary 
comes in and serializing/ deserializing it similar to TSTLookup.

Sounds ok?





 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-03-31 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013933#comment-13013933
 ] 

Andrzej Bialecki  commented on SOLR-2378:
-

bq. I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) 
sorts matches in a priority queue by their associated value (frequency I guess)

Correct. I agree that the name is so-so, I inherited it from the spellchecker 
API - feel free to change it.

bq. the Dictionary can come from various sources, ...

Yes. This is again a legacy of the Lucene SpellChecker API. I tried to extend 
this API in Solr without changing it in Lucene (see the *IteratorWrapper 
classes and TermFreqIterator) but ultimately it would be better to refactor 
this.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
Assignee: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-02-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997702#comment-12997702
 ] 

Dawid Weiss commented on SOLR-2378:
---

Hi guys. I probably need those admin rights on SOLR's JIRA project as well 
because I can't assign this to myself.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-02-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997703#comment-12997703
 ] 

Uwe Schindler commented on SOLR-2378:
-

This time I cannot help - I don't have admin rights in SOLR! In my opinion, we 
should merge the user rights of both projects.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.

2011-02-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997786#comment-12997786
 ] 

Mark Miller commented on SOLR-2378:
---

bq. Hi guys. I probably need those admin rights on SOLR's JIRA project as well 
because I can't assign this to myself

You should have a committer role now.

 FST-based Lookup (suggestions) for prefix matches.
 --

 Key: SOLR-2378
 URL: https://issues.apache.org/jira/browse/SOLR-2378
 Project: Solr
  Issue Type: New Feature
  Components: spellchecker
Reporter: Dawid Weiss
  Labels: lookup, prefix
 Fix For: 4.0


 Implement a subclass of Lookup based on finite state automata/ transducers 
 (Lucene FST package). This issue is for implementing a relatively basic 
 prefix matcher, we will handle infixes and other types of input matches 
 gradually.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org