Re: shingles work in analyzer but not real data

2010-09-07 Thread Chris Hostetter

: Hi Robert, thanks for the response.  I've looked into the query parsers a
: bit and I did find that using the raw parser on a matching multi-word
: keyword works correctly.  I need to have shingling though, in order to
: support query phrases.  It seems odd to have the query parser emitting

The "FieldQParser" should work for this -- unlike the raw QParser it uses 
the Analyzer for the specified field, but has no metacharacters of it's 
own.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: shingles work in analyzer but not real data

2010-09-03 Thread Dennis Gearon
Thank you mucho much, Lance.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/3/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: shingles work in analyzer but not real data
> To: solr-user@lucene.apache.org
> Date: Friday, September 3, 2010, 9:55 PM
> http://en.wikipedia.org/wiki/W-shingling
> 
> On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe 
> wrote:
> > Hi Dennis,
> >
> > I took a stab at answering this question in the
> following java-user mailing list post:
> >
> > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes
> >
> > Steve
> >
> >> -Original Message-
> >> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> >> Sent: Friday, September 03, 2010 5:06 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: shingles work in analyzer but not
> real data
> >>
> >> Anyone got a definitive, authoritative link to the
> definition of a
> >> 'shingle' in search engine results/technology?
> >>
> >>
> >> Dennis Gearon
> >>
> >> Signature Warning
> >> 
> >> EARTH has a Right To Life,
> >>   otherwise we all die.
> >>
> >> Read 'Hot, Flat, and Crowded'
> >> Laugh at http://www.yert.com/film.php
> >>
> >>
> >> --- On Fri, 9/3/10, Jeff Rose 
> wrote:
> >>
> >> > From: Jeff Rose 
> >> > Subject: Re: shingles work in analyzer but
> not real data
> >> > To: solr-user@lucene.apache.org
> >> > Date: Friday, September 3, 2010, 1:48 AM
> >> > Thanks Steven and Jonathan, we got it
> >> > working by using a combination of
> >> > quoting and the PositionFilterFactory, like
> is shown
> >> > below.  The
> >> > documentation for the position filter doesn't
> make much
> >> > sense without
> >> > understanding more about how positioning of
> tokens is taken
> >> > into account,
> >> > but it appears to do the trick.  Does anyone
> know why
> >> > position would matter
> >> > here?  It seems like tokens would be emitted
> by a
> >> > tokenizer, filtered,
> >> > joined into pairwise tokens by the shingler,
> and then
> >> > matched against the
> >> > index.  If position information is also
> important it
> >> > seems odd that this is
> >> > not discussed in the documentation..  (Same
> for the
> >> > pre-tokenizing done by
> >> > the query parser, before handing phrases to
> the
> >> > tokenizer...)
> >> >
> >> > Anyway, here is our final schema that works
> as long as we
> >> > put search phrases
> >> > in double quotes.  Thanks for all the help!
> >> >
> >> > -Jeff
> >> >
> >> >   class="solr.TextField"
> >> > positionIncrementGap="100">
> >> >       
> >> >          >> > class="solr.PatternTokenizerFactory"
> pattern=";"/>
> >> >          >> > class="solr.LowerCaseFilterFactory"/>
> >> >          >> > class="solr.TrimFilterFactory" />
> >> >          >> > class="solr.LowerCaseFilterFactory"/>
> >> >         
> >> >       
> >> >       
> >> >          >> > class="solr.PatternTokenizerFactory"
> pattern="[.,?;:
> >> > !]"/>
> >> >   class="solr.LowerCaseFilterFactory"/>
> >> >           >> > class="solr.TrimFilterFactory" />
> >> >   class="solr.ShingleFilterFactory"/>
> >> >   class="solr.PositionFilterFactory"/>
> >> >       
> >> >     
> >> >
> >> >
> >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan
> Rochkind 
> >> > wrote:
> >> >
> >> > > I've run into this before too. Both the
> dismax and
> >> > solr-lucene _query
> >> > > parsers_ will tokenize a query on
> whitespace _before_
> >> > they pass the query to
> >> > > any field analyzers.
> >> > > There are some reasons for this, lots of
> things
> >>

Re: shingles work in analyzer but not real data

2010-09-03 Thread Lance Norskog
http://en.wikipedia.org/wiki/W-shingling

On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe  wrote:
> Hi Dennis,
>
> I took a stab at answering this question in the following java-user mailing 
> list post:
>
> http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes
>
> Steve
>
>> -Original Message-
>> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
>> Sent: Friday, September 03, 2010 5:06 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: shingles work in analyzer but not real data
>>
>> Anyone got a definitive, authoritative link to the definition of a
>> 'shingle' in search engine results/technology?
>>
>>
>> Dennis Gearon
>>
>> Signature Warning
>> 
>> EARTH has a Right To Life,
>>   otherwise we all die.
>>
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>>
>>
>> --- On Fri, 9/3/10, Jeff Rose  wrote:
>>
>> > From: Jeff Rose 
>> > Subject: Re: shingles work in analyzer but not real data
>> > To: solr-user@lucene.apache.org
>> > Date: Friday, September 3, 2010, 1:48 AM
>> > Thanks Steven and Jonathan, we got it
>> > working by using a combination of
>> > quoting and the PositionFilterFactory, like is shown
>> > below.  The
>> > documentation for the position filter doesn't make much
>> > sense without
>> > understanding more about how positioning of tokens is taken
>> > into account,
>> > but it appears to do the trick.  Does anyone know why
>> > position would matter
>> > here?  It seems like tokens would be emitted by a
>> > tokenizer, filtered,
>> > joined into pairwise tokens by the shingler, and then
>> > matched against the
>> > index.  If position information is also important it
>> > seems odd that this is
>> > not discussed in the documentation..  (Same for the
>> > pre-tokenizing done by
>> > the query parser, before handing phrases to the
>> > tokenizer...)
>> >
>> > Anyway, here is our final schema that works as long as we
>> > put search phrases
>> > in double quotes.  Thanks for all the help!
>> >
>> > -Jeff
>> >
>> >  > > positionIncrementGap="100">
>> >       
>> >         > > class="solr.PatternTokenizerFactory" pattern=";"/>
>> >         > > class="solr.LowerCaseFilterFactory"/>
>> >         > > class="solr.TrimFilterFactory" />
>> >         > > class="solr.LowerCaseFilterFactory"/>
>> >         
>> >       
>> >       
>> >         > > class="solr.PatternTokenizerFactory" pattern="[.,?;:
>> > !]"/>
>> >  
>> >          > > class="solr.TrimFilterFactory" />
>> >  
>> >  
>> >       
>> >     
>> >
>> >
>> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind 
>> > wrote:
>> >
>> > > I've run into this before too. Both the dismax and
>> > solr-lucene _query
>> > > parsers_ will tokenize a query on whitespace _before_
>> > they pass the query to
>> > > any field analyzers.
>> > > There are some reasons for this, lots of things
>> > wouldn't work if they
>> > > didn't do this.
>> > >
>> > > But it makes your approach kind of hard. Try doing
>> > your search as a phrase
>> > > search with double quotes, "apple pie", I bet it'll
>> > work then -- because
>> > > both dismax and solr-lucene will respect the phrase
>> > quotes and NOT tokenize
>> > > the stuff inside there before it gets to the field
>> > analyzers.
>> > >
>> > > So if non-tokenized fields like this are all that are
>> > included in your
>> > > search, and if you can get your client application to
>> > just force phrase
>> > > quoting of everything before sending to Solr, that
>> > might work. Otherwise
>> > > I don't know of a good solution. If you figure one
>> > out, let me know.
>> > >
>> > > Jonathan
>> > >
>> > >
>> > > Jeff Rose wrote:
>> > >
>> > >> Hi,
>> > >>  We are using SOLR to match query strings
>> > with a ke

RE: shingles work in analyzer but not real data

2010-09-03 Thread Steven A Rowe
Hi Dennis,

I took a stab at answering this question in the following java-user mailing 
list post:

http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes

Steve

> -Original Message-
> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> Sent: Friday, September 03, 2010 5:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: shingles work in analyzer but not real data
> 
> Anyone got a definitive, authoritative link to the definition of a
> 'shingle' in search engine results/technology?
> 
> 
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>   otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Fri, 9/3/10, Jeff Rose  wrote:
> 
> > From: Jeff Rose 
> > Subject: Re: shingles work in analyzer but not real data
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 3, 2010, 1:48 AM
> > Thanks Steven and Jonathan, we got it
> > working by using a combination of
> > quoting and the PositionFilterFactory, like is shown
> > below.  The
> > documentation for the position filter doesn't make much
> > sense without
> > understanding more about how positioning of tokens is taken
> > into account,
> > but it appears to do the trick.  Does anyone know why
> > position would matter
> > here?  It seems like tokens would be emitted by a
> > tokenizer, filtered,
> > joined into pairwise tokens by the shingler, and then
> > matched against the
> > index.  If position information is also important it
> > seems odd that this is
> > not discussed in the documentation..  (Same for the
> > pre-tokenizing done by
> > the query parser, before handing phrases to the
> > tokenizer...)
> >
> > Anyway, here is our final schema that works as long as we
> > put search phrases
> > in double quotes.  Thanks for all the help!
> >
> > -Jeff
> >
> >   > positionIncrementGap="100">
> >       
> >          > class="solr.PatternTokenizerFactory" pattern=";"/>
> >          > class="solr.LowerCaseFilterFactory"/>
> >          > class="solr.TrimFilterFactory" />
> >          > class="solr.LowerCaseFilterFactory"/>
> >         
> >       
> >       
> >          > class="solr.PatternTokenizerFactory" pattern="[.,?;:
> > !]"/>
> >  
> >           > class="solr.TrimFilterFactory" />
> >  
> >  
> >       
> >     
> >
> >
> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind 
> > wrote:
> >
> > > I've run into this before too. Both the dismax and
> > solr-lucene _query
> > > parsers_ will tokenize a query on whitespace _before_
> > they pass the query to
> > > any field analyzers.
> > > There are some reasons for this, lots of things
> > wouldn't work if they
> > > didn't do this.
> > >
> > > But it makes your approach kind of hard. Try doing
> > your search as a phrase
> > > search with double quotes, "apple pie", I bet it'll
> > work then -- because
> > > both dismax and solr-lucene will respect the phrase
> > quotes and NOT tokenize
> > > the stuff inside there before it gets to the field
> > analyzers.
> > >
> > > So if non-tokenized fields like this are all that are
> > included in your
> > > search, and if you can get your client application to
> > just force phrase
> > > quoting of everything before sending to Solr, that
> > might work. Otherwise
> > > I don't know of a good solution. If you figure one
> > out, let me know.
> > >
> > > Jonathan
> > >
> > >
> > > Jeff Rose wrote:
> > >
> > >> Hi,
> > >>  We are using SOLR to match query strings
> > with a keyword database, where
> > >> some of the keywords are actually more than one
> > word.  For example a
> > >> keyword
> > >> might be "apple pie" and we only want it to match
> > for a query containing
> > >> that word pair, but not one only containing
> > "apple".  Here is the relevant
> > >> piece of the schema.xml, defining the index and
> > query pipelines:
> > >>
> > >>   > class="solr.TextField" positionIncrementGap="100">
> > >

Re: shingles work in analyzer but not real data

2010-09-03 Thread 朱炎詹

Look up pp.288 in "Solr 1.4 Enterprise Search Engine" book by Eric & David.

Shingling is suitable for phrase query case based on token level, it's 
similar with n-gram. However, the latter one is based on term.


We are currently using shingling in our index with shingle size = 3. Be 
careful, the builing time of index & index dize could be dramatically long & 
large as the max shinlge size increases.


Scott

- Original Message - 
From: "Jeff Rose" 

To: 
Sent: Friday, September 03, 2010 5:35 PM
Subject: Re: shingles work in analyzer but not real data



I don't have any fancy links, but from the documentation shingles make
pretty good sense.

You typically tokenize an input string so that "the best apple pie" 
becomes
"the" "best" "apple" "pie", so that each term can then be filtered to 
remove

stop words, take off plurals and suffixes like "ing", etc.  The problem is
if you want to search for multi-word phrases, like "apple pie".  This
default splitting behavior won't let you do that, so to deal with this
problem you can use shingles.  The shingle filter will take in successive
tokens and then produce a series of output tokens composed of the last 1-n
tokens, where n is a setting.  So with shingles of size 2, the default, 
you
get "the" "the best" "best" "best apple" "apple" "apple pie" from the 
above

string.  Now we can match "apple pie".

Besides the shingling there is apparently also some concept of position,
which I don't yet understand.

-Jeff

On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon 
wrote:



Anyone got a definitive, authoritative link to the definition of a
'shingle' in search engine results/technology?


Dennis Gearon

Signature Warning
----
EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/3/10, Jeff Rose  wrote:

> From: Jeff Rose 
> Subject: Re: shingles work in analyzer but not real data
> To: solr-user@lucene.apache.org
> Date: Friday, September 3, 2010, 1:48 AM
> Thanks Steven and Jonathan, we got it
> working by using a combination of
> quoting and the PositionFilterFactory, like is shown
> below.  The
> documentation for the position filter doesn't make much
> sense without
> understanding more about how positioning of tokens is taken
> into account,
> but it appears to do the trick.  Does anyone know why
> position would matter
> here?  It seems like tokens would be emitted by a
> tokenizer, filtered,
> joined into pairwise tokens by the shingler, and then
> matched against the
> index.  If position information is also important it
> seems odd that this is
> not discussed in the documentation..  (Same for the
> pre-tokenizing done by
> the query parser, before handing phrases to the
> tokenizer...)
>
> Anyway, here is our final schema that works as long as we
> put search phrases
> in double quotes.  Thanks for all the help!
>
> -Jeff
>
>   positionIncrementGap="100">
>   
>  class="solr.PatternTokenizerFactory" pattern=";"/>
>  class="solr.LowerCaseFilterFactory"/>
>  class="solr.TrimFilterFactory" />
>  class="solr.LowerCaseFilterFactory"/>
> 
>   
>   
>  class="solr.PatternTokenizerFactory" pattern="[.,?;:
> !]"/>
>  
>   class="solr.TrimFilterFactory" />
>  
>  
>   
> 
>
>
> On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind 
> wrote:
>
> > I've run into this before too. Both the dismax and
> solr-lucene _query
> > parsers_ will tokenize a query on whitespace _before_
> they pass the query to
> > any field analyzers.
> > There are some reasons for this, lots of things
> wouldn't work if they
> > didn't do this.
> >
> > But it makes your approach kind of hard. Try doing
> your search as a phrase
> > search with double quotes, "apple pie", I bet it'll
> work then -- because
> > both dismax and solr-lucene will respect the phrase
> quotes and NOT tokenize
> > the stuff inside there before it gets to the field
> analyzers.
> >
> > So if non-tokenized fields like this are all that are
> included in your
> > search, and if you can get your client application to
> just force phrase
> > quoting of everything before sending to Solr, that
> might work. Otherwise
> > I don't know of a good solution. If you figure one
> out, let me know.
> >
>

Re: shingles work in analyzer but not real data

2010-09-03 Thread Jeff Rose
I don't have any fancy links, but from the documentation shingles make
pretty good sense.

You typically tokenize an input string so that "the best apple pie" becomes
"the" "best" "apple" "pie", so that each term can then be filtered to remove
stop words, take off plurals and suffixes like "ing", etc.  The problem is
if you want to search for multi-word phrases, like "apple pie".  This
default splitting behavior won't let you do that, so to deal with this
problem you can use shingles.  The shingle filter will take in successive
tokens and then produce a series of output tokens composed of the last 1-n
tokens, where n is a setting.  So with shingles of size 2, the default, you
get "the" "the best" "best" "best apple" "apple" "apple pie" from the above
string.  Now we can match "apple pie".

Besides the shingling there is apparently also some concept of position,
which I don't yet understand.

-Jeff

On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon wrote:

> Anyone got a definitive, authoritative link to the definition of a
> 'shingle' in search engine results/technology?
>
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/3/10, Jeff Rose  wrote:
>
> > From: Jeff Rose 
> > Subject: Re: shingles work in analyzer but not real data
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 3, 2010, 1:48 AM
> > Thanks Steven and Jonathan, we got it
> > working by using a combination of
> > quoting and the PositionFilterFactory, like is shown
> > below.  The
> > documentation for the position filter doesn't make much
> > sense without
> > understanding more about how positioning of tokens is taken
> > into account,
> > but it appears to do the trick.  Does anyone know why
> > position would matter
> > here?  It seems like tokens would be emitted by a
> > tokenizer, filtered,
> > joined into pairwise tokens by the shingler, and then
> > matched against the
> > index.  If position information is also important it
> > seems odd that this is
> > not discussed in the documentation..  (Same for the
> > pre-tokenizing done by
> > the query parser, before handing phrases to the
> > tokenizer...)
> >
> > Anyway, here is our final schema that works as long as we
> > put search phrases
> > in double quotes.  Thanks for all the help!
> >
> > -Jeff
> >
> >   > positionIncrementGap="100">
> >   
> >  > class="solr.PatternTokenizerFactory" pattern=";"/>
> >  > class="solr.LowerCaseFilterFactory"/>
> >  > class="solr.TrimFilterFactory" />
> >  > class="solr.LowerCaseFilterFactory"/>
> > 
> >   
> >   
> >  > class="solr.PatternTokenizerFactory" pattern="[.,?;:
> > !]"/>
> >  
> >   > class="solr.TrimFilterFactory" />
> >  
> >  
> >   
> > 
> >
> >
> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind 
> > wrote:
> >
> > > I've run into this before too. Both the dismax and
> > solr-lucene _query
> > > parsers_ will tokenize a query on whitespace _before_
> > they pass the query to
> > > any field analyzers.
> > > There are some reasons for this, lots of things
> > wouldn't work if they
> > > didn't do this.
> > >
> > > But it makes your approach kind of hard. Try doing
> > your search as a phrase
> > > search with double quotes, "apple pie", I bet it'll
> > work then -- because
> > > both dismax and solr-lucene will respect the phrase
> > quotes and NOT tokenize
> > > the stuff inside there before it gets to the field
> > analyzers.
> > >
> > > So if non-tokenized fields like this are all that are
> > included in your
> > > search, and if you can get your client application to
> > just force phrase
> > > quoting of everything before sending to Solr, that
> > might work. Otherwise
> > > I don't know of a good solution. If you figure one
> > out, let me know.
> > >
> > > Jonathan
> > >
> > >
> > > Jeff Rose wrote:
> > >
> > >> Hi

Re: shingles work in analyzer but not real data

2010-09-03 Thread Dennis Gearon
Anyone got a definitive, authoritative link to the definition of a 'shingle' in 
search engine results/technology?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/3/10, Jeff Rose  wrote:

> From: Jeff Rose 
> Subject: Re: shingles work in analyzer but not real data
> To: solr-user@lucene.apache.org
> Date: Friday, September 3, 2010, 1:48 AM
> Thanks Steven and Jonathan, we got it
> working by using a combination of
> quoting and the PositionFilterFactory, like is shown
> below.  The
> documentation for the position filter doesn't make much
> sense without
> understanding more about how positioning of tokens is taken
> into account,
> but it appears to do the trick.  Does anyone know why
> position would matter
> here?  It seems like tokens would be emitted by a
> tokenizer, filtered,
> joined into pairwise tokens by the shingler, and then
> matched against the
> index.  If position information is also important it
> seems odd that this is
> not discussed in the documentation..  (Same for the
> pre-tokenizing done by
> the query parser, before handing phrases to the
> tokenizer...)
> 
> Anyway, here is our final schema that works as long as we
> put search phrases
> in double quotes.  Thanks for all the help!
> 
> -Jeff
> 
>   positionIncrementGap="100">
>       
>          class="solr.PatternTokenizerFactory" pattern=";"/>
>          class="solr.LowerCaseFilterFactory"/>
>          class="solr.TrimFilterFactory" />
>          class="solr.LowerCaseFilterFactory"/>
>         
>       
>       
>          class="solr.PatternTokenizerFactory" pattern="[.,?;:
> !]"/>
>  
>           class="solr.TrimFilterFactory" />
>  
>  
>       
>     
> 
> 
> On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind 
> wrote:
> 
> > I've run into this before too. Both the dismax and
> solr-lucene _query
> > parsers_ will tokenize a query on whitespace _before_
> they pass the query to
> > any field analyzers.
> > There are some reasons for this, lots of things
> wouldn't work if they
> > didn't do this.
> >
> > But it makes your approach kind of hard. Try doing
> your search as a phrase
> > search with double quotes, "apple pie", I bet it'll
> work then -- because
> > both dismax and solr-lucene will respect the phrase
> quotes and NOT tokenize
> > the stuff inside there before it gets to the field
> analyzers.
> >
> > So if non-tokenized fields like this are all that are
> included in your
> > search, and if you can get your client application to
> just force phrase
> > quoting of everything before sending to Solr, that
> might work. Otherwise
> > I don't know of a good solution. If you figure one
> out, let me know.
> >
> > Jonathan
> >
> >
> > Jeff Rose wrote:
> >
> >> Hi,
> >>  We are using SOLR to match query strings
> with a keyword database, where
> >> some of the keywords are actually more than one
> word.  For example a
> >> keyword
> >> might be "apple pie" and we only want it to match
> for a query containing
> >> that word pair, but not one only containing
> "apple".  Here is the relevant
> >> piece of the schema.xml, defining the index and
> query pipelines:
> >>
> >>   class="solr.TextField" positionIncrementGap="100">
> >>      type="index">
> >>        class="solr.PatternTokenizerFactory" pattern=";"/>
> >>         class="solr.LowerCaseFilterFactory"/>
> >>         class="solr.TrimFilterFactory" />
> >>     
> >>      type="query">
> >>         class="solr.WhitespaceTokenizerFactory"/>
> >>  class="solr.LowerCaseFilterFactory"/>
> >>         class="solr.TrimFilterFactory" />
> >>  />
> >>      
> >>   
> >>
> >> In the analysis tool this schema looks like it
> works correctly.  Our
> >> multi-word keywords are indexed as a single entry,
> and then when a search
> >> phrase contains one of these multi-word keywords
> it is shingled and
> >> matched.
> >>  Unfortunately, when we do the same queries
> on top of the actual index it
> >> responds with zero matches.  I can see in the
> index histogram that the
> >> terms
> >> are correctly indexed from our mysql datasource
> containing the keywords,
> >> but
> >> somehow the shingling doesn't appear to work on
> this live data.  Does
> >> anyone
> >> have experience with shingling that might have
> some tips for us, or
> >> otherwise advice for debugging the issue?
> >>
> >> Thanks,
> >> Jeff
> >>
> >>
> >>
> >
>


Re: shingles work in analyzer but not real data

2010-09-03 Thread Jeff Rose
Thanks Steven and Jonathan, we got it working by using a combination of
quoting and the PositionFilterFactory, like is shown below.  The
documentation for the position filter doesn't make much sense without
understanding more about how positioning of tokens is taken into account,
but it appears to do the trick.  Does anyone know why position would matter
here?  It seems like tokens would be emitted by a tokenizer, filtered,
joined into pairwise tokens by the shingler, and then matched against the
index.  If position information is also important it seems odd that this is
not discussed in the documentation..  (Same for the pre-tokenizing done by
the query parser, before handing phrases to the tokenizer...)

Anyway, here is our final schema that works as long as we put search phrases
in double quotes.  Thanks for all the help!

-Jeff

 
  





  
  

 
 
 
 
  



On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind  wrote:

> I've run into this before too. Both the dismax and solr-lucene _query
> parsers_ will tokenize a query on whitespace _before_ they pass the query to
> any field analyzers.
> There are some reasons for this, lots of things wouldn't work if they
> didn't do this.
>
> But it makes your approach kind of hard. Try doing your search as a phrase
> search with double quotes, "apple pie", I bet it'll work then -- because
> both dismax and solr-lucene will respect the phrase quotes and NOT tokenize
> the stuff inside there before it gets to the field analyzers.
>
> So if non-tokenized fields like this are all that are included in your
> search, and if you can get your client application to just force phrase
> quoting of everything before sending to Solr, that might work. Otherwise
> I don't know of a good solution. If you figure one out, let me know.
>
> Jonathan
>
>
> Jeff Rose wrote:
>
>> Hi,
>>  We are using SOLR to match query strings with a keyword database, where
>> some of the keywords are actually more than one word.  For example a
>> keyword
>> might be "apple pie" and we only want it to match for a query containing
>> that word pair, but not one only containing "apple".  Here is the relevant
>> piece of the schema.xml, defining the index and query pipelines:
>>
>>  
>> 
>>   
>>
>>
>> 
>> 
>>
>> 
>>
>> 
>>  
>>   
>>
>> In the analysis tool this schema looks like it works correctly.  Our
>> multi-word keywords are indexed as a single entry, and then when a search
>> phrase contains one of these multi-word keywords it is shingled and
>> matched.
>>  Unfortunately, when we do the same queries on top of the actual index it
>> responds with zero matches.  I can see in the index histogram that the
>> terms
>> are correctly indexed from our mysql datasource containing the keywords,
>> but
>> somehow the shingling doesn't appear to work on this live data.  Does
>> anyone
>> have experience with shingling that might have some tips for us, or
>> otherwise advice for debugging the issue?
>>
>> Thanks,
>> Jeff
>>
>>
>>
>


Re: shingles work in analyzer but not real data

2010-09-02 Thread Dennis Gearon
I thought shingles were either a viral infection or roof material?

(Hey, it's crazy friday early for me)
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/2/10, Jonathan Rochkind  wrote:

> From: Jonathan Rochkind 
> Subject: Re: shingles work in analyzer but not real data
> To: "solr-user@lucene.apache.org" 
> Cc: "Vishal Patel" , "Michiel Willekens" 
> 
> Date: Thursday, September 2, 2010, 2:47 PM
> I've run into this before too. Both
> the dismax and solr-lucene _query parsers_ will tokenize a
> query on whitespace _before_ they pass the query to any
> field analyzers. 
> There are some reasons for this, lots of things wouldn't
> work if they didn't do this.
> 
> But it makes your approach kind of hard. Try doing your
> search as a phrase search with double quotes, "apple pie", I
> bet it'll work then -- because both dismax and solr-lucene
> will respect the phrase quotes and NOT tokenize the stuff
> inside there before it gets to the field analyzers.
> 
> So if non-tokenized fields like this are all that are
> included in your search, and if you can get your client
> application to just force phrase quoting of everything
> before sending to Solr, that might work. Otherwise I
> don't know of a good solution. If you figure one out, let me
> know.
> 
> Jonathan
> 
> Jeff Rose wrote:
> > Hi,
> >   We are using SOLR to match query
> strings with a keyword database, where
> > some of the keywords are actually more than one
> word.  For example a keyword
> > might be "apple pie" and we only want it to match for
> a query containing
> > that word pair, but not one only containing
> "apple".  Here is the relevant
> > piece of the schema.xml, defining the index and query
> pipelines:
> > 
> >    class="solr.TextField" positionIncrementGap="100">
> >      
> >         class="solr.PatternTokenizerFactory" pattern=";"/>
> >          class="solr.LowerCaseFilterFactory"/>
> >          class="solr.TrimFilterFactory" />
> >      
> >      
> >          class="solr.WhitespaceTokenizerFactory"/>
> > 
> >          class="solr.TrimFilterFactory" />
> > 
> >       
> >    
> > 
> > In the analysis tool this schema looks like it works
> correctly.  Our
> > multi-word keywords are indexed as a single entry, and
> then when a search
> > phrase contains one of these multi-word keywords it is
> shingled and matched.
> >  Unfortunately, when we do the same queries on
> top of the actual index it
> > responds with zero matches.  I can see in the
> index histogram that the terms
> > are correctly indexed from our mysql datasource
> containing the keywords, but
> > somehow the shingling doesn't appear to work on this
> live data.  Does anyone
> > have experience with shingling that might have some
> tips for us, or
> > otherwise advice for debugging the issue?
> > 
> > Thanks,
> > Jeff
> > 
> >   


Re: shingles work in analyzer but not real data

2010-09-02 Thread Jonathan Rochkind
I've run into this before too. Both the dismax and solr-lucene _query 
parsers_ will tokenize a query on whitespace _before_ they pass the 
query to any field analyzers. 

There are some reasons for this, lots of things wouldn't work if they 
didn't do this.


But it makes your approach kind of hard. Try doing your search as a 
phrase search with double quotes, "apple pie", I bet it'll work then -- 
because both dismax and solr-lucene will respect the phrase quotes and 
NOT tokenize the stuff inside there before it gets to the field analyzers.


So if non-tokenized fields like this are all that are included in your 
search, and if you can get your client application to just force phrase 
quoting of everything before sending to Solr, that might work. 
Otherwise I don't know of a good solution. If you figure one out, 
let me know.


Jonathan

Jeff Rose wrote:

Hi,
  We are using SOLR to match query strings with a keyword database, where
some of the keywords are actually more than one word.  For example a keyword
might be "apple pie" and we only want it to match for a query containing
that word pair, but not one only containing "apple".  Here is the relevant
piece of the schema.xml, defining the index and query pipelines:

  
 
   


 
 




  
   

In the analysis tool this schema looks like it works correctly.  Our
multi-word keywords are indexed as a single entry, and then when a search
phrase contains one of these multi-word keywords it is shingled and matched.
 Unfortunately, when we do the same queries on top of the actual index it
responds with zero matches.  I can see in the index histogram that the terms
are correctly indexed from our mysql datasource containing the keywords, but
somehow the shingling doesn't appear to work on this live data.  Does anyone
have experience with shingling that might have some tips for us, or
otherwise advice for debugging the issue?

Thanks,
Jeff

  


RE: shingles work in analyzer but not real data

2010-09-02 Thread Steven A Rowe
Hi Jeff,

Have you seen PositionFilterFactory?:
 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory>

Steve

> -Original Message-
> From: Jeff Rose [mailto:j...@globalorange.nl]
> Sent: Thursday, September 02, 2010 9:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: shingles work in analyzer but not real data
> 
> On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir  wrote:
> 
> > On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose  wrote:
> >
> > > Hi,
> > >  We are using SOLR to match query strings with a keyword database,
> where
> > > some of the keywords are actually more than one word.  For example a
> > > keyword
> > > might be "apple pie" and we only want it to match for a query
> containing
> > > that word pair, but not one only containing "apple".  Here is the
> > relevant
> > > piece of the schema.xml, defining the index and query pipelines:
> > >
> > >   > positionIncrementGap="100">
> > > 
> > >   
> > >
> > >
> > > 
> > > 
> > >
> > > 
> > >
> > > 
> > >  
> > >   
> > >
> > > In the analysis tool this schema looks like it works correctly.  Our
> > > multi-word keywords are indexed as a single entry, and then when a
> search
> > > phrase contains one of these multi-word keywords it is shingled and
> > > matched.
> > >  Unfortunately, when we do the same queries on top of the actual index
> it
> > > responds with zero matches.  I can see in the index histogram that the
> > > terms
> > > are correctly indexed from our mysql datasource containing the
> keywords,
> > > but
> > > somehow the shingling doesn't appear to work on this live data.  Does
> > > anyone
> > > have experience with shingling that might have some tips for us, or
> > > otherwise advice for debugging the issue?
> > >
> >
> > query-time shingling probably isnt working with the queryparser you are
> > using, the default lucene one first splits on whitespace before sending
> it
> > to the analyzer: e.g. a query of foo bar is processed as
> TokenStream(foo) +
> > TokenStream(bar)
> >
> > so query-time shingling like this doesn't work as you expect for this
> > reason.
> 
> 
> Hi Robert, thanks for the response.  I've looked into the query parsers a
> bit and I did find that using the raw parser on a matching multi-word
> keyword works correctly.  I need to have shingling though, in order to
> support query phrases.  It seems odd to have the query parser emitting
> tokens though.  If this is the case why would we ever use the
> WhitespaceTokenizer?  Either way, do you know what the correct
> configuration
> should be to actually perform shingling as it is documented to work:
> joining
> adjacent tokens into a single search term?  (e.g. "apple" "pie" should
> become "apple pie")
> 
> Thanks  a lot for the help.
> 
> -Jeff
> 
> P.S. Markus, putting double quotes around the query doesn't seem to have
> any
> effect.  It would be nice to have the analysis debug output on the actual
> queries so that I could see what is being searched for after analysis...


Re: shingles work in analyzer but not real data

2010-09-02 Thread Jeff Rose
On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir  wrote:

> On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose  wrote:
>
> > Hi,
> >  We are using SOLR to match query strings with a keyword database, where
> > some of the keywords are actually more than one word.  For example a
> > keyword
> > might be "apple pie" and we only want it to match for a query containing
> > that word pair, but not one only containing "apple".  Here is the
> relevant
> > piece of the schema.xml, defining the index and query pipelines:
> >
> >   positionIncrementGap="100">
> > 
> >   
> >
> >
> > 
> > 
> >
> > 
> >
> > 
> >  
> >   
> >
> > In the analysis tool this schema looks like it works correctly.  Our
> > multi-word keywords are indexed as a single entry, and then when a search
> > phrase contains one of these multi-word keywords it is shingled and
> > matched.
> >  Unfortunately, when we do the same queries on top of the actual index it
> > responds with zero matches.  I can see in the index histogram that the
> > terms
> > are correctly indexed from our mysql datasource containing the keywords,
> > but
> > somehow the shingling doesn't appear to work on this live data.  Does
> > anyone
> > have experience with shingling that might have some tips for us, or
> > otherwise advice for debugging the issue?
> >
>
> query-time shingling probably isnt working with the queryparser you are
> using, the default lucene one first splits on whitespace before sending it
> to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
> TokenStream(bar)
>
> so query-time shingling like this doesn't work as you expect for this
> reason.


Hi Robert, thanks for the response.  I've looked into the query parsers a
bit and I did find that using the raw parser on a matching multi-word
keyword works correctly.  I need to have shingling though, in order to
support query phrases.  It seems odd to have the query parser emitting
tokens though.  If this is the case why would we ever use the
WhitespaceTokenizer?  Either way, do you know what the correct configuration
should be to actually perform shingling as it is documented to work: joining
adjacent tokens into a single search term?  (e.g. "apple" "pie" should
become "apple pie")

Thanks  a lot for the help.

-Jeff

P.S. Markus, putting double quotes around the query doesn't seem to have any
effect.  It would be nice to have the analysis debug output on the actual
queries so that I could see what is being searched for after analysis...


Re: shingles work in analyzer but not real data

2010-09-01 Thread Markus Jelsma
If your use-case is limited to this, why don't you encapsulate all queries in 
double quotes? 

On Wednesday 01 September 2010 14:21:47 Jeff Rose wrote:
> Hi,
>   We are using SOLR to match query strings with a keyword database, where
> some of the keywords are actually more than one word.  For example a
>  keyword might be "apple pie" and we only want it to match for a query
>  containing that word pair, but not one only containing "apple".  Here is
>  the relevant piece of the schema.xml, defining the index and query
>  pipelines:
> 
>   
>  
>
> 
> 
>  
>  
> 
> 
> 
> 
>   
>
> 
> In the analysis tool this schema looks like it works correctly.  Our
> multi-word keywords are indexed as a single entry, and then when a search
> phrase contains one of these multi-word keywords it is shingled and
>  matched. Unfortunately, when we do the same queries on top of the actual
>  index it responds with zero matches.  I can see in the index histogram
>  that the terms are correctly indexed from our mysql datasource containing
>  the keywords, but somehow the shingling doesn't appear to work on this
>  live data.  Does anyone have experience with shingling that might have
>  some tips for us, or otherwise advice for debugging the issue?
> 
> Thanks,
> Jeff
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: shingles work in analyzer but not real data

2010-09-01 Thread Robert Muir
On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose  wrote:

> Hi,
>  We are using SOLR to match query strings with a keyword database, where
> some of the keywords are actually more than one word.  For example a
> keyword
> might be "apple pie" and we only want it to match for a query containing
> that word pair, but not one only containing "apple".  Here is the relevant
> piece of the schema.xml, defining the index and query pipelines:
>
>  
> 
>   
>
>
> 
> 
>
> 
>
> 
>  
>   
>
> In the analysis tool this schema looks like it works correctly.  Our
> multi-word keywords are indexed as a single entry, and then when a search
> phrase contains one of these multi-word keywords it is shingled and
> matched.
>  Unfortunately, when we do the same queries on top of the actual index it
> responds with zero matches.  I can see in the index histogram that the
> terms
> are correctly indexed from our mysql datasource containing the keywords,
> but
> somehow the shingling doesn't appear to work on this live data.  Does
> anyone
> have experience with shingling that might have some tips for us, or
> otherwise advice for debugging the issue?
>

query-time shingling probably isnt working with the queryparser you are
using, the default lucene one first splits on whitespace before sending it
to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
TokenStream(bar)

so query-time shingling like this doesn't work as you expect for this
reason.


-- 
Robert Muir
rcm...@gmail.com


shingles work in analyzer but not real data

2010-09-01 Thread Jeff Rose
Hi,
  We are using SOLR to match query strings with a keyword database, where
some of the keywords are actually more than one word.  For example a keyword
might be "apple pie" and we only want it to match for a query containing
that word pair, but not one only containing "apple".  Here is the relevant
piece of the schema.xml, defining the index and query pipelines:

  
 
   


 
 




  
   

In the analysis tool this schema looks like it works correctly.  Our
multi-word keywords are indexed as a single entry, and then when a search
phrase contains one of these multi-word keywords it is shingled and matched.
 Unfortunately, when we do the same queries on top of the actual index it
responds with zero matches.  I can see in the index histogram that the terms
are correctly indexed from our mysql datasource containing the keywords, but
somehow the shingling doesn't appear to work on this live data.  Does anyone
have experience with shingling that might have some tips for us, or
otherwise advice for debugging the issue?

Thanks,
Jeff