Re: Indexing all permutations of words from the input

2011-01-22 Thread Erick Erickson
OK, Idea from left field off the top of my head, so don't take it for
gospel...

Create a second index where you send your data, each phrase is really a
"document"
and query *that* index for your autosuggest. Perhaps this could be a
secondary core.
It could even be a set of *special* documents in your existing index that
had orthogonal
fields to the normal ones.

The idea is that you'd have a "document" consisting of one stored and
indexed field
that contained "abc xyz foo". Searching for "+abc +foo" (no quotes) would
return it,
as would searching for "+abc +xyz" or +abc or +foo... You could even do
some
interesting things with dismax if you required some rule like "at least two
terms
must match if there are three" I think...

You'd have to do something about duplicates here...

Best
Erick

On Thu, Jan 20, 2011 at 4:58 PM, Steven A Rowe  wrote:

> Hi Martin,
>
> The co-occurrence filter I'm working on at
> https://issues.apache.org/jira/browse/LUCENE-2749 would do what you want
> (among other things).  Still vaporware at this point, as I've only put a
> couple of hours into it, so don't hold your breath :)
>
> Steve
>
> > -Original Message-
> > From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> > Sent: Thursday, January 20, 2011 4:46 PM
> > To: Martin Jansen
> > Cc: solr-user@lucene.apache.org
> > Subject: Re: Indexing all permutations of words from the input
> >
> > Aha, I have no idea if there actually is a better way of achieving that,
> > auto-completion with Solr is always tricky and I personally have not
> > been happy with any of the designs I've seen suggested for it.  But I'm
> > also not entirely sure your design will actually work, but neither am I
> > sure it won't!
> >
> > I am thinking maybe for that auto-complete use, you will actually need
> > your field to be NOT tokenized, so you won't want to use the WhiteSpace
> > tokenizer after all (I think!) -- unless maybe there's another filter
> > you can put at the end of the chain that will take all the tokens and
> > join them back together,  seperated by a single space,  as a single
> > token.  But I do think you'll need the whole multi-word string to be a
> > single token in order to use terms.prefix how you want.
> >
> > If you can't make ShingleFilter do it though, I don't think there is any
> > built in analyzers that will do the transformation you want. You could
> > write your own in Java, perhaps based on ShingleFilter -- or it might be
> > easier to have your own software make the transformations you want and
> > then simply send the pre-transformed strings to Solr when indexing. Then
> > you could simply send them to a 'string' type field that won't tokenize.
> >
> > On 1/20/2011 4:40 PM, Martin Jansen wrote:
> > > On 20.01.11 22:19, Jonathan Rochkind wrote:
> > >> On 1/20/2011 4:03 PM, Martin Jansen wrote:
> > >>> I'm looking for an   configuration for Solr 1.4 that
> > >>> accomplishes the following:
> > >>>
> > >>> Given the input "abc xyz foo" I would like to add at least the
> > following
> > >>> token combinations to the index:
> > >>>
> > >>>  abc
> > >>>  abc xyz
> > >>>  abc xyz foo
> > >>>  abc foo
> > >>>  xyz
> > >>>  xyz foo
> > >>>  foo
> > >>>
> > >> Why do you want to do this, what is it meant to accomplish?  There
> > might be a better way to accomplish what it is you are trying to do; I
> > can't think of anything (which doesn't mean it doesn't exist) that what
> > you're actually trying to do would be required in order to do.  What
> sorts
> > of queries do you intend to serve with this setup?
> > > I'm in the process of setting up an index for term suggestion. In my
> use
> > > case people should get the suggestion "abc foo" for the search query
> > > "abc fo" and under the assumption that "abc xyz foo" has been submitted
> > > to the index.
> > >
> > > My current plan is to use TermsComponent with the terms.prefix=
> > > parameter for this, because it seems to be pretty efficient and I get
> > > things like correct sorting for free.
> > >
> > > I assume there is a better way for achieving this then?
> > >
> > > - Martin
>


RE: Indexing all permutations of words from the input

2011-01-20 Thread Steven A Rowe
Hi Martin,

The co-occurrence filter I'm working on at
https://issues.apache.org/jira/browse/LUCENE-2749 would do what you want (among 
other things).  Still vaporware at this point, as I've only put a couple of 
hours into it, so don't hold your breath :)

Steve

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Thursday, January 20, 2011 4:46 PM
> To: Martin Jansen
> Cc: solr-user@lucene.apache.org
> Subject: Re: Indexing all permutations of words from the input
> 
> Aha, I have no idea if there actually is a better way of achieving that,
> auto-completion with Solr is always tricky and I personally have not
> been happy with any of the designs I've seen suggested for it.  But I'm
> also not entirely sure your design will actually work, but neither am I
> sure it won't!
> 
> I am thinking maybe for that auto-complete use, you will actually need
> your field to be NOT tokenized, so you won't want to use the WhiteSpace
> tokenizer after all (I think!) -- unless maybe there's another filter
> you can put at the end of the chain that will take all the tokens and
> join them back together,  seperated by a single space,  as a single
> token.  But I do think you'll need the whole multi-word string to be a
> single token in order to use terms.prefix how you want.
> 
> If you can't make ShingleFilter do it though, I don't think there is any
> built in analyzers that will do the transformation you want. You could
> write your own in Java, perhaps based on ShingleFilter -- or it might be
> easier to have your own software make the transformations you want and
> then simply send the pre-transformed strings to Solr when indexing. Then
> you could simply send them to a 'string' type field that won't tokenize.
> 
> On 1/20/2011 4:40 PM, Martin Jansen wrote:
> > On 20.01.11 22:19, Jonathan Rochkind wrote:
> >> On 1/20/2011 4:03 PM, Martin Jansen wrote:
> >>> I'm looking for an   configuration for Solr 1.4 that
> >>> accomplishes the following:
> >>>
> >>> Given the input "abc xyz foo" I would like to add at least the
> following
> >>> token combinations to the index:
> >>>
> >>>  abc
> >>>  abc xyz
> >>>  abc xyz foo
> >>>  abc foo
> >>>  xyz
> >>>  xyz foo
> >>>  foo
> >>>
> >> Why do you want to do this, what is it meant to accomplish?  There
> might be a better way to accomplish what it is you are trying to do; I
> can't think of anything (which doesn't mean it doesn't exist) that what
> you're actually trying to do would be required in order to do.  What sorts
> of queries do you intend to serve with this setup?
> > I'm in the process of setting up an index for term suggestion. In my use
> > case people should get the suggestion "abc foo" for the search query
> > "abc fo" and under the assumption that "abc xyz foo" has been submitted
> > to the index.
> >
> > My current plan is to use TermsComponent with the terms.prefix=
> > parameter for this, because it seems to be pretty efficient and I get
> > things like correct sorting for free.
> >
> > I assume there is a better way for achieving this then?
> >
> > - Martin


Re: Indexing all permutations of words from the input

2011-01-20 Thread Jonathan Rochkind
Aha, I have no idea if there actually is a better way of achieving that, 
auto-completion with Solr is always tricky and I personally have not 
been happy with any of the designs I've seen suggested for it.  But I'm 
also not entirely sure your design will actually work, but neither am I 
sure it won't!


I am thinking maybe for that auto-complete use, you will actually need 
your field to be NOT tokenized, so you won't want to use the WhiteSpace 
tokenizer after all (I think!) -- unless maybe there's another filter 
you can put at the end of the chain that will take all the tokens and 
join them back together,  seperated by a single space,  as a single 
token.  But I do think you'll need the whole multi-word string to be a 
single token in order to use terms.prefix how you want.


If you can't make ShingleFilter do it though, I don't think there is any 
built in analyzers that will do the transformation you want. You could 
write your own in Java, perhaps based on ShingleFilter -- or it might be 
easier to have your own software make the transformations you want and 
then simply send the pre-transformed strings to Solr when indexing. Then 
you could simply send them to a 'string' type field that won't tokenize.


On 1/20/2011 4:40 PM, Martin Jansen wrote:

On 20.01.11 22:19, Jonathan Rochkind wrote:

On 1/20/2011 4:03 PM, Martin Jansen wrote:

I'm looking for an   configuration for Solr 1.4 that
accomplishes the following:

Given the input "abc xyz foo" I would like to add at least the following
token combinations to the index:

 abc
 abc xyz
 abc xyz foo
 abc foo
 xyz
 xyz foo
 foo


Why do you want to do this, what is it meant to accomplish?  There might be a 
better way to accomplish what it is you are trying to do; I can't think of 
anything (which doesn't mean it doesn't exist) that what you're actually trying 
to do would be required in order to do.  What sorts of queries do you intend to 
serve with this setup?

I'm in the process of setting up an index for term suggestion. In my use
case people should get the suggestion "abc foo" for the search query
"abc fo" and under the assumption that "abc xyz foo" has been submitted
to the index.

My current plan is to use TermsComponent with the terms.prefix=
parameter for this, because it seems to be pretty efficient and I get
things like correct sorting for free.

I assume there is a better way for achieving this then?

- Martin


Re: Indexing all permutations of words from the input

2011-01-20 Thread Martin Jansen
On 20.01.11 22:19, Jonathan Rochkind wrote:
> On 1/20/2011 4:03 PM, Martin Jansen wrote:
>> I'm looking for an  configuration for Solr 1.4 that
>> accomplishes the following:
>>
>> Given the input "abc xyz foo" I would like to add at least the following
>> token combinations to the index:
>>
>> abc
>> abc xyz
>> abc xyz foo
>> abc foo
>> xyz
>> xyz foo
>> foo
>>
> Why do you want to do this, what is it meant to accomplish?  There might be a 
> better way to accomplish what it is you are trying to do; I can't think of 
> anything (which doesn't mean it doesn't exist) that what you're actually 
> trying to do would be required in order to do.  What sorts of queries do you 
> intend to serve with this setup?

I'm in the process of setting up an index for term suggestion. In my use
case people should get the suggestion "abc foo" for the search query
"abc fo" and under the assumption that "abc xyz foo" has been submitted
to the index.

My current plan is to use TermsComponent with the terms.prefix=
parameter for this, because it seems to be pretty efficient and I get
things like correct sorting for free.

I assume there is a better way for achieving this then?

- Martin


Re: Indexing all permutations of words from the input

2011-01-20 Thread Jonathan Rochkind
Why do you want to do this, what is it meant to accomplish?  There might 
be a better way to accomplish what it is you are trying to do; I can't 
think of anything (which doesn't mean it doesn't exist) that what you're 
actually trying to do would be required in order to do.  What sorts of 
queries do you intend to serve with this setup?


I don't believe there is any analyzer that will do exactly what you've 
specified, included with Solr out of the box. You could definitely write 
your own analyzer in Java to do it. But I still suspect you may not 
actually need to construct your index like that to accomplish whatever 
you are trying to accomplish.


The only point I can think of to caring what words are next to what 
other words is for phrase and proximity searches. However, with what 
you've specified, phrase and proximity searches wouldn't be at all 
useful anyway, as EVERY word would be next to every other word, so any 
phrase or proximity search including any words present at all would 
match, so might as well not do a phrase and proximity search at all, in 
which case it should not matter what order or how close together the 
words are in the index.   Why not just use an ordinary Whitespace 
Tokenizer, and just do ordinary dismax or lucene queries without using 
phrase or proximity?


On 1/20/2011 4:03 PM, Martin Jansen wrote:

Hey there,

I'm looking for an  configuration for Solr 1.4 that
accomplishes the following:

Given the input "abc xyz foo" I would like to add at least the following
token combinations to the index:

abc
abc xyz
abc xyz foo
abc foo
xyz
xyz foo
foo

A WhitespaceTokenizer combined with a ShingleFilter will take me there
to some extent, but won't e.g. add "abc foo" to the index.  Is there a
way to do this?

- Martin


Indexing all permutations of words from the input

2011-01-20 Thread Martin Jansen
Hey there,

I'm looking for an  configuration for Solr 1.4 that
accomplishes the following:

Given the input "abc xyz foo" I would like to add at least the following
token combinations to the index:

abc
abc xyz
abc xyz foo
abc foo
xyz
xyz foo
foo

A WhitespaceTokenizer combined with a ShingleFilter will take me there
to some extent, but won't e.g. add "abc foo" to the index.  Is there a
way to do this?

- Martin