[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

Commented Mon, 05 Dec 2011 04:48:08 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162752#comment-13162752
 ]


Carlos González-Cadenas commented on LUCENE-3298:
-------------------------------------------------

Yeap, at the beginning of this project we tried to implement this autocomplete 
system using regular inverted indexes, but the response time required for 
autocomplete to work from a user perspective is very low (<50ms), and it would 
be quite hard to achieve such a performance with inverted indexes. 

I still think this is the way to go, but as you say we have to be careful with 
the data generation part. Most of the work should be put in making sure that 
the data is well distributed and organized in order to avoid combinatorial 
explosion.

Let me go in detail with the sources of data permutations and the reasoning 
behind them:

1) With regards to infix matches, if a user types "barcelona" we want to match 
"hotels in barcelona". In order to achieve this, we generate:

hotels in barcelona => hotels in barcelona
in barcelona => hotels in barcelona
barcelona => hotels in barcelona

The FST should be able to conflate these prefixes nicely in just one path, 
right?. Therefore this part shouldn't be a problem.

2) In addition, another feature we want to achieve is to be able to match 
inputs without prepositions. That means that if the user types "hotels 
barcelona jacuzzi", we should be able to match "hoteles in barcelona with 
jacuzzi". Now the only way we envision of doing it properly is to generate this 
permutation within the data:

hotels barcelona jacuzzi => hotels in barcelona with jacuzzi

I can see how this can explode the FST by creating different branches. 
Theoretically this could be done at runtime without the need of generating the 
data, but we don't see a way to do it in a clean way. To make things more 
complicated :) we've implemented fuzzy matching at query time (we use a 
levenshtein automata generated with the user input + an edit distance and then 
we intersect with the FST), and this is making very complicated to do 
preposition handling at query time. 

3) PP permutations (i.e. hoteles in barcelona with jacuzzi and hoteles with 
jacuzzi in barcelona). I don't really see a way to work around this. Probably 
we need to be careful and only generate these permutations for the top-K 
cities, in order to limit the potential size.

Summarizing, I believe that we can reduce the set of "bad permutations" a lot 
if we can figure out how to implement the prepositions at runtime. If you have 
any ideas, let me know. Thanks! :)



                
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
>                 Key: LUCENE-3298
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3298
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is 
> indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

Reply via email to