Re: Indexing all permutations of words from the input
OK, Idea from left field off the top of my head, so don't take it for gospel... Create a second index where you send your data, each phrase is really a "document" and query *that* index for your autosuggest. Perhaps this could be a secondary core. It could even be a set of *special* documents in your existing index that had orthogonal fields to the normal ones. The idea is that you'd have a "document" consisting of one stored and indexed field that contained "abc xyz foo". Searching for "+abc +foo" (no quotes) would return it, as would searching for "+abc +xyz" or +abc or +foo... You could even do some interesting things with dismax if you required some rule like "at least two terms must match if there are three" I think... You'd have to do something about duplicates here... Best Erick On Thu, Jan 20, 2011 at 4:58 PM, Steven A Rowe wrote: > Hi Martin, > > The co-occurrence filter I'm working on at > https://issues.apache.org/jira/browse/LUCENE-2749 would do what you want > (among other things). Still vaporware at this point, as I've only put a > couple of hours into it, so don't hold your breath :) > > Steve > > > -Original Message- > > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > > Sent: Thursday, January 20, 2011 4:46 PM > > To: Martin Jansen > > Cc: solr-user@lucene.apache.org > > Subject: Re: Indexing all permutations of words from the input > > > > Aha, I have no idea if there actually is a better way of achieving that, > > auto-completion with Solr is always tricky and I personally have not > > been happy with any of the designs I've seen suggested for it. But I'm > > also not entirely sure your design will actually work, but neither am I > > sure it won't! > > > > I am thinking maybe for that auto-complete use, you will actually need > > your field to be NOT tokenized, so you won't want to use the WhiteSpace > > tokenizer after all (I think!) -- unless maybe there's another filter > > you can put at the end of the chain that will take all the tokens and > > join them back together, seperated by a single space, as a single > > token. But I do think you'll need the whole multi-word string to be a > > single token in order to use terms.prefix how you want. > > > > If you can't make ShingleFilter do it though, I don't think there is any > > built in analyzers that will do the transformation you want. You could > > write your own in Java, perhaps based on ShingleFilter -- or it might be > > easier to have your own software make the transformations you want and > > then simply send the pre-transformed strings to Solr when indexing. Then > > you could simply send them to a 'string' type field that won't tokenize. > > > > On 1/20/2011 4:40 PM, Martin Jansen wrote: > > > On 20.01.11 22:19, Jonathan Rochkind wrote: > > >> On 1/20/2011 4:03 PM, Martin Jansen wrote: > > >>> I'm looking for an configuration for Solr 1.4 that > > >>> accomplishes the following: > > >>> > > >>> Given the input "abc xyz foo" I would like to add at least the > > following > > >>> token combinations to the index: > > >>> > > >>> abc > > >>> abc xyz > > >>> abc xyz foo > > >>> abc foo > > >>> xyz > > >>> xyz foo > > >>> foo > > >>> > > >> Why do you want to do this, what is it meant to accomplish? There > > might be a better way to accomplish what it is you are trying to do; I > > can't think of anything (which doesn't mean it doesn't exist) that what > > you're actually trying to do would be required in order to do. What > sorts > > of queries do you intend to serve with this setup? > > > I'm in the process of setting up an index for term suggestion. In my > use > > > case people should get the suggestion "abc foo" for the search query > > > "abc fo" and under the assumption that "abc xyz foo" has been submitted > > > to the index. > > > > > > My current plan is to use TermsComponent with the terms.prefix= > > > parameter for this, because it seems to be pretty efficient and I get > > > things like correct sorting for free. > > > > > > I assume there is a better way for achieving this then? > > > > > > - Martin >
RE: Indexing all permutations of words from the input
Hi Martin, The co-occurrence filter I'm working on at https://issues.apache.org/jira/browse/LUCENE-2749 would do what you want (among other things). Still vaporware at this point, as I've only put a couple of hours into it, so don't hold your breath :) Steve > -Original Message- > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > Sent: Thursday, January 20, 2011 4:46 PM > To: Martin Jansen > Cc: solr-user@lucene.apache.org > Subject: Re: Indexing all permutations of words from the input > > Aha, I have no idea if there actually is a better way of achieving that, > auto-completion with Solr is always tricky and I personally have not > been happy with any of the designs I've seen suggested for it. But I'm > also not entirely sure your design will actually work, but neither am I > sure it won't! > > I am thinking maybe for that auto-complete use, you will actually need > your field to be NOT tokenized, so you won't want to use the WhiteSpace > tokenizer after all (I think!) -- unless maybe there's another filter > you can put at the end of the chain that will take all the tokens and > join them back together, seperated by a single space, as a single > token. But I do think you'll need the whole multi-word string to be a > single token in order to use terms.prefix how you want. > > If you can't make ShingleFilter do it though, I don't think there is any > built in analyzers that will do the transformation you want. You could > write your own in Java, perhaps based on ShingleFilter -- or it might be > easier to have your own software make the transformations you want and > then simply send the pre-transformed strings to Solr when indexing. Then > you could simply send them to a 'string' type field that won't tokenize. > > On 1/20/2011 4:40 PM, Martin Jansen wrote: > > On 20.01.11 22:19, Jonathan Rochkind wrote: > >> On 1/20/2011 4:03 PM, Martin Jansen wrote: > >>> I'm looking for an configuration for Solr 1.4 that > >>> accomplishes the following: > >>> > >>> Given the input "abc xyz foo" I would like to add at least the > following > >>> token combinations to the index: > >>> > >>> abc > >>> abc xyz > >>> abc xyz foo > >>> abc foo > >>> xyz > >>> xyz foo > >>> foo > >>> > >> Why do you want to do this, what is it meant to accomplish? There > might be a better way to accomplish what it is you are trying to do; I > can't think of anything (which doesn't mean it doesn't exist) that what > you're actually trying to do would be required in order to do. What sorts > of queries do you intend to serve with this setup? > > I'm in the process of setting up an index for term suggestion. In my use > > case people should get the suggestion "abc foo" for the search query > > "abc fo" and under the assumption that "abc xyz foo" has been submitted > > to the index. > > > > My current plan is to use TermsComponent with the terms.prefix= > > parameter for this, because it seems to be pretty efficient and I get > > things like correct sorting for free. > > > > I assume there is a better way for achieving this then? > > > > - Martin
Re: Indexing all permutations of words from the input
Aha, I have no idea if there actually is a better way of achieving that, auto-completion with Solr is always tricky and I personally have not been happy with any of the designs I've seen suggested for it. But I'm also not entirely sure your design will actually work, but neither am I sure it won't! I am thinking maybe for that auto-complete use, you will actually need your field to be NOT tokenized, so you won't want to use the WhiteSpace tokenizer after all (I think!) -- unless maybe there's another filter you can put at the end of the chain that will take all the tokens and join them back together, seperated by a single space, as a single token. But I do think you'll need the whole multi-word string to be a single token in order to use terms.prefix how you want. If you can't make ShingleFilter do it though, I don't think there is any built in analyzers that will do the transformation you want. You could write your own in Java, perhaps based on ShingleFilter -- or it might be easier to have your own software make the transformations you want and then simply send the pre-transformed strings to Solr when indexing. Then you could simply send them to a 'string' type field that won't tokenize. On 1/20/2011 4:40 PM, Martin Jansen wrote: On 20.01.11 22:19, Jonathan Rochkind wrote: On 1/20/2011 4:03 PM, Martin Jansen wrote: I'm looking for an configuration for Solr 1.4 that accomplishes the following: Given the input "abc xyz foo" I would like to add at least the following token combinations to the index: abc abc xyz abc xyz foo abc foo xyz xyz foo foo Why do you want to do this, what is it meant to accomplish? There might be a better way to accomplish what it is you are trying to do; I can't think of anything (which doesn't mean it doesn't exist) that what you're actually trying to do would be required in order to do. What sorts of queries do you intend to serve with this setup? I'm in the process of setting up an index for term suggestion. In my use case people should get the suggestion "abc foo" for the search query "abc fo" and under the assumption that "abc xyz foo" has been submitted to the index. My current plan is to use TermsComponent with the terms.prefix= parameter for this, because it seems to be pretty efficient and I get things like correct sorting for free. I assume there is a better way for achieving this then? - Martin
Re: Indexing all permutations of words from the input
On 20.01.11 22:19, Jonathan Rochkind wrote: > On 1/20/2011 4:03 PM, Martin Jansen wrote: >> I'm looking for an configuration for Solr 1.4 that >> accomplishes the following: >> >> Given the input "abc xyz foo" I would like to add at least the following >> token combinations to the index: >> >> abc >> abc xyz >> abc xyz foo >> abc foo >> xyz >> xyz foo >> foo >> > Why do you want to do this, what is it meant to accomplish? There might be a > better way to accomplish what it is you are trying to do; I can't think of > anything (which doesn't mean it doesn't exist) that what you're actually > trying to do would be required in order to do. What sorts of queries do you > intend to serve with this setup? I'm in the process of setting up an index for term suggestion. In my use case people should get the suggestion "abc foo" for the search query "abc fo" and under the assumption that "abc xyz foo" has been submitted to the index. My current plan is to use TermsComponent with the terms.prefix= parameter for this, because it seems to be pretty efficient and I get things like correct sorting for free. I assume there is a better way for achieving this then? - Martin
Re: Indexing all permutations of words from the input
Why do you want to do this, what is it meant to accomplish? There might be a better way to accomplish what it is you are trying to do; I can't think of anything (which doesn't mean it doesn't exist) that what you're actually trying to do would be required in order to do. What sorts of queries do you intend to serve with this setup? I don't believe there is any analyzer that will do exactly what you've specified, included with Solr out of the box. You could definitely write your own analyzer in Java to do it. But I still suspect you may not actually need to construct your index like that to accomplish whatever you are trying to accomplish. The only point I can think of to caring what words are next to what other words is for phrase and proximity searches. However, with what you've specified, phrase and proximity searches wouldn't be at all useful anyway, as EVERY word would be next to every other word, so any phrase or proximity search including any words present at all would match, so might as well not do a phrase and proximity search at all, in which case it should not matter what order or how close together the words are in the index. Why not just use an ordinary Whitespace Tokenizer, and just do ordinary dismax or lucene queries without using phrase or proximity? On 1/20/2011 4:03 PM, Martin Jansen wrote: Hey there, I'm looking for an configuration for Solr 1.4 that accomplishes the following: Given the input "abc xyz foo" I would like to add at least the following token combinations to the index: abc abc xyz abc xyz foo abc foo xyz xyz foo foo A WhitespaceTokenizer combined with a ShingleFilter will take me there to some extent, but won't e.g. add "abc foo" to the index. Is there a way to do this? - Martin
Indexing all permutations of words from the input
Hey there, I'm looking for an configuration for Solr 1.4 that accomplishes the following: Given the input "abc xyz foo" I would like to add at least the following token combinations to the index: abc abc xyz abc xyz foo abc foo xyz xyz foo foo A WhitespaceTokenizer combined with a ShingleFilter will take me there to some extent, but won't e.g. add "abc foo" to the index. Is there a way to do this? - Martin