prefix search

2011-10-25 Thread Radha Krishna Reddy
Hi,

when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like
q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result
for the complete string that start with 'T'.means i want only 'Terry' as the
result.

Can i do this?

Thanks and Regards,
Radha Krishna.


Re: prefix search

2011-10-25 Thread Alireza Salimi
That's because the phrases are being tokenized and then indexed by Solr.
You have to define a new fieldType which is not tokenized.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

I'm not sure if it would solve your problem

On Tue, Oct 25, 2011 at 5:46 AM, Radha Krishna Reddy 
radhakrishn...@gmail.com wrote:

 Hi,

 when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like
 q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result
 for the complete string that start with 'T'.means i want only 'Terry' as
 the
 result.

 Can i do this?

 Thanks and Regards,
 Radha Krishna.




-- 
Alireza Salimi
Java EE Developer


Re: prefix search

2011-10-25 Thread Michael Kuhlmann
I think what Radha Krishna (is this really her name?) means is different:

She wants to return only the matching token instead of the complete
field value.

Indeed, this is not possible. But you could use highlighting
(http://wiki.apache.org/solr/HighlightingParameters), and then extract
the matching part on your own. This shouldn't be too complicated.

-Kuli

Am 25.10.2011 12:12, schrieb Alireza Salimi:
 That's because the phrases are being tokenized and then indexed by Solr.
 You have to define a new fieldType which is not tokenized.
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 I'm not sure if it would solve your problem
 
 On Tue, Oct 25, 2011 at 5:46 AM, Radha Krishna Reddy 
 radhakrishn...@gmail.com wrote:
 
 Hi,

 when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like
 q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result
 for the complete string that start with 'T'.means i want only 'Terry' as
 the
 result.

 Can i do this?

 Thanks and Regards,
 Radha Krishna.

 
 
 



meaning of underscore in prefix search.

2010-07-05 Thread stockii

Hello.

i use facet.prefix and terms.prefix for my search. what is the meaning of
the underscore _ in the results. when change solr some string into a
underscore ? sometimes it make no sence to suggest the client with this ... 

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/

charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/

filter class=solr.TrimFilterFactory/
filter class=solr.StandardFilterFactory/  
filter class=solr.LowerCaseFilterFactory/   
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
thx !
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/meaning-of-underscore-in-prefix-search-tp944120p944120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Prefix-Search with Stopwords - no results?

2010-05-31 Thread Gert Brinkmann

On 28.05.2010 22:06, Chris Hostetter wrote:

and one text_prefix
defined similarly but with an additional EdgeNGramTokenFilter used when
indexing to generate prefix tokens. then search those fields using
dismax...


To be sure that I understand this right:

Am I right that I should not stopword filter the EdgeNGramTokenFilter 
field? Otherwise I would run into the same problems again, won't I?


Or if stopword filtering is ok on this field: Do you filter the 
stopwords before or after EdgeNGram tokenizing?


Thanks,
Gert


Re: Prefix-Search with Stopwords - no results?

2010-05-29 Thread Gert Brinkmann


Thank you, Chris and Erick, for the answers,

it was new to me that the* is expanded to all known the* words in the 
index. Good to know.


And yes, the AND operation between the query terms are certainly the 
problem. (I would like to switch to OR instead. The result set will grow 
the more words you are searching for, but as the results are ordered for 
the hit quality this would be ok. But the customer does not like this 
behaviour, because he thinks that the more words you are searching for, 
the smaller the result set should become. So this is not an option.).


On 28.05.2010 22:06, Chris Hostetter wrote:

word2*) ... in the client, that you instead consider using multiple
fields -- one text defined as you have it now, and one text_prefix
defined similarly but with an additional EdgeNGramTokenFilter used when
indexing to generate prefix tokens. then search those fields using
dismax...

q=word1 word2 word3  qf=text text_prefix  mm=100%  tie=0


Ok, I will think about this. But I wonder if this will be more efficient 
than just not filtering stopwords? (But I have to study the EdgeNGram 
thing first. AFAIK it indexes all WORDS as WORDS, WORD, WOR, WO. So the 
index will be blown up, too?)


What I do not understand in your idea, why I should use a second 
text_prefix field. Wouldn't it work with just this text_prefix without 
the normal text field, too, as I always let search for word and 
word* and never without the prefix?


Thanks,
Gert


Re: Prefix-Search with Stopwords - no results?

2010-05-29 Thread Erick Erickson
Well, the index does, indeed, get bigger. But the searches
get much faster because there's no term expansion going
on. It's another time/space tradeoff.  I'm afraid you'll have
to just experiment a bit to see if this is an acceptable tradeoff.
in your particular situation

The real memory hit in Lucene comes from *sorting* a field
with many unique terms. And you won't sort on the NGram
field I don't think and disk space is cheap.

Best
Erick

On Sat, May 29, 2010 at 3:44 AM, Gert Brinkmann g...@netcologne.de wrote:


 Thank you, Chris and Erick, for the answers,

 it was new to me that the* is expanded to all known the* words in the
 index. Good to know.

 And yes, the AND operation between the query terms are certainly the
 problem. (I would like to switch to OR instead. The result set will grow the
 more words you are searching for, but as the results are ordered for the hit
 quality this would be ok. But the customer does not like this behaviour,
 because he thinks that the more words you are searching for, the smaller the
 result set should become. So this is not an option.).


 On 28.05.2010 22:06, Chris Hostetter wrote:

 word2*) ... in the client, that you instead consider using multiple
 fields -- one text defined as you have it now, and one text_prefix
 defined similarly but with an additional EdgeNGramTokenFilter used when
 indexing to generate prefix tokens. then search those fields using
 dismax...

 q=word1 word2 word3  qf=text text_prefix  mm=100%  tie=0


 Ok, I will think about this. But I wonder if this will be more efficient
 than just not filtering stopwords? (But I have to study the EdgeNGram thing
 first. AFAIK it indexes all WORDS as WORDS, WORD, WOR, WO. So the index will
 be blown up, too?)

 What I do not understand in your idea, why I should use a second
 text_prefix field. Wouldn't it work with just this text_prefix without the
 normal text field, too, as I always let search for word and word* and
 never without the prefix?

 Thanks,
 Gert



Prefix-Search with Stopwords - no results?

2010-05-28 Thread Gert Brinkmann


Hello,

I am having some problems with solr 1.4. I am indexing and querying data 
using the following fieldType:



fieldType name=text_de_de class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_de_de.txt
enablePositionIncrements=true
/
filter class=solr.LengthFilterFactory min=2 max=200/
filter class=solr.SnowballPorterFilterFactory language=German /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms_de_de.txt 
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords_de_de.txt
enablePositionIncrements=true
  /
filter class=solr.LengthFilterFactory min=2 max=200/
filter class=solr.SnowballPorterFilterFactory language=German /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


The application that is using solr does prepare the search string to 
filter out some dangerous characters like brackets and wildcards, etc, 
that otherwise might lead to a wrong query syntax.


All words are searched for as a normal word as well as a prefix. E.g.: 
für solr is converted by the application to

  (für OR für*) AND (solr OR solr*)

This works fine for normal words. But if you have a stopword like für 
in this example, the query will be stopword filtered by solr to 
something like this:

  (für*) AND (solr OR solr*)

The problem now is (as I think) that there is no für* anymore in the 
indexed data, because it was stopword filtered, too. If now someone 
copypastes a sentence from an indexed document that contains a 
stopword, this document will not be found by solr.


The enablePositionIncrements=true only is (AFAIU) for querying 
phrases, but not for my case of word OR word* queries.


So, what should I do? Is there a better filter combination that I could 
try? Or am I doing something wrong conceptually? The only solution that 
I have found working is to not use stopword filtering at all.


Greetings,
Gert



Re: Prefix-Search with Stopwords - no results?

2010-05-28 Thread Erick Erickson
Hmmm, I don't really see the problem here. I'll have to use English
examples...

Searching on the* (assuming the is a stopword) will search on
(them OR theory OR thespian) assuming those three words are in
your index. It will NOT search on the. So I think you're OK, or are
you seeing anomalous results?

Conceptually, the underlying lucene looks through your *existing* list of
terms for the field to assemble a clause containing the OR of all the
terms that match the wildcard. Since the isn't in the index, it doesn't
get included.

HTH
Erick

On Fri, May 28, 2010 at 11:25 AM, Gert Brinkmann g...@netcologne.de wrote:


 Hello,

 I am having some problems with solr 1.4. I am indexing and querying data
 using the following fieldType:

 fieldType name=text_de_de class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_de_de.txt
enablePositionIncrements=true
/
filter class=solr.LengthFilterFactory min=2 max=200/
filter class=solr.SnowballPorterFilterFactory language=German
 /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
 synonyms=synonyms_de_de.txt ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords_de_de.txt
enablePositionIncrements=true
  /
filter class=solr.LengthFilterFactory min=2 max=200/
filter class=solr.SnowballPorterFilterFactory language=German
 /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


 The application that is using solr does prepare the search string to filter
 out some dangerous characters like brackets and wildcards, etc, that
 otherwise might lead to a wrong query syntax.

 All words are searched for as a normal word as well as a prefix. E.g.: für
 solr is converted by the application to
  (für OR für*) AND (solr OR solr*)

 This works fine for normal words. But if you have a stopword like für in
 this example, the query will be stopword filtered by solr to something like
 this:
  (für*) AND (solr OR solr*)

 The problem now is (as I think) that there is no für* anymore in the
 indexed data, because it was stopword filtered, too. If now someone
 copypastes a sentence from an indexed document that contains a stopword,
 this document will not be found by solr.

 The enablePositionIncrements=true only is (AFAIU) for querying phrases,
 but not for my case of word OR word* queries.

 So, what should I do? Is there a better filter combination that I could
 try? Or am I doing something wrong conceptually? The only solution that I
 have found working is to not use stopword filtering at all.

 Greetings,
 Gert




Re: Prefix-Search with Stopwords - no results?

2010-05-28 Thread Chris Hostetter

: Searching on the* (assuming the is a stopword) will search on
: (them OR theory OR thespian) assuming those three words are in
: your index. It will NOT search on the. So I think you're OK, or are
: you seeing anomalous results?

i think the missing pieces to hte puzzle here are:

1) wildcard and prefix queries aren't analyzed, so the* (or für*) 
doesnt' get analyzed, and the system has no way of spoting that it's a 
stopword that should be removed from the query -- nor should it in general 
since the fact that the is a stpword doens't mean the* is an invalid 
query.  I could very concievabley be trying to find words like thespian

2) by using the AND operator you are forcing both clauses to match...

:   (für*) AND (solr OR solr*)

...so that query will only turn up results if a document containing a word 
that starts with solr and a word that starts with für existing in your 
index.

:  The problem now is (as I think) that there is no für* anymore in the
:  indexed data, because it was stopword filtered, too. If now someone

the _word* für doesn't exist in your index because it's a stopword, but 
there may be other words in your index starting with the prefix für -- 
and if those words appear in documents that also contain words starting 
with solr then you will actually get matches.

:  So, what should I do? Is there a better filter combination that I could
:  try? Or am I doing something wrong conceptually? The only solution that I
:  have found working is to not use stopword filtering at all.


I would suggest that intstead of your existing approach of taking word1 
word2 word3 ... and converting it to (word1 OR word1*) AND (word2 OR 
word2*) ... in the client, that you instead consider using multiple 
fields -- one text defined as you have it now, and one text_prefix 
defined similarly but with an additional EdgeNGramTokenFilter used when 
indexing to generate prefix tokens. then search those fields using 
dismax...

q=word1 word2 word3  qf=text text_prefix  mm=100%  tie=0



-Hoss


Highlighting on Prefix-Search Bug/Workaround (Re: query with stemming, prefix and fuzzy?)

2009-02-04 Thread Gert Brinkmann
Mark Miller wrote:

 Currently I think about dropping the stemming and only use
 prefix-search. But as highlighting does not work with a prefix house*
 this is a problem for me. The hint to use house?* instead does not
 work here.
   
 Thats because wildcard queries are also not highlightable now. I
 actually have somewhat of a solution to this that I'll work on soon
 (I've gotten the ground work for it in or ready to be in Lucene). No
 guarantee on when or if it will be accepted in solr though.

As I am writing in perl (using WebService::Solr) I found the workaround
to use the Search::Tools module for highlighting manually in those
cases if Solr does not return snippets. This seems to work fine, but the
drawback is, that I need Solr to return the full data field in a query.
This can be expensive on larger documents. But I hope this is just a
temporal workaround until Solr 1.4...

Thanks,
Gert



Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Martin Grotzke

On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
 On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
  Is it possible that the prefix-processing ignores the filters?
 
 Yes, It's a known limitation that we haven't worked out a fix for yet.
 The issue is that you can't just run the prefix through the filters
 because of things like stop words, stemming, minimum length filters,
 etc.
What about not having only facet.prefix but additionally
facet.filtered.prefix that runs the prefix through the filters?
Would that be possible?

Cheers,
Martin

 
 -Yonik
 



signature.asc
Description: This is a digitally signed message part


Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Yonik Seeley
On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote:
 On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
  On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
   Is it possible that the prefix-processing ignores the filters?
 
  Yes, It's a known limitation that we haven't worked out a fix for yet.
  The issue is that you can't just run the prefix through the filters
  because of things like stop words, stemming, minimum length filters,
  etc.

 What about not having only facet.prefix but additionally
 facet.filtered.prefix that runs the prefix through the filters?
 Would that be possible?

The underlying issue remains - it's not safe to treat the prefix like
any other word when running it through the filters.

-Yonik


Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Martin Grotzke

On Mon, 2007-10-29 at 13:31 -0400, Yonik Seeley wrote:
 On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote:
  On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
   On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
Is it possible that the prefix-processing ignores the filters?
  
   Yes, It's a known limitation that we haven't worked out a fix for yet.
   The issue is that you can't just run the prefix through the filters
   because of things like stop words, stemming, minimum length filters,
   etc.
 
  What about not having only facet.prefix but additionally
  facet.filtered.prefix that runs the prefix through the filters?
  Would that be possible?
 
 The underlying issue remains - it's not safe to treat the prefix like
 any other word when running it through the filters.
Yes, definitely the user that uses this feature should know what it
does - but at least there would be the possibility to run the prefix
through e.g. a LowerCaseFilter. Finally the user knows what filters
he has configured. E.g. if you only want an ignore-case prefix test,
s.th. like a facet.filtered.prefix would be really valuable.

Cheers,
Martin


 
 -Yonik
 



signature.asc
Description: This is a digitally signed message part


prefix-search ingnores the lowerCaseFilter

2007-10-25 Thread Max Scheffler

Hi,

I want to perform a prefix-search which ignores cases. To do this I 
created a fielType called suggest:


fieldType name=suggest class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

Entrys (terms) could be 'foo', 'bar'...

A request like

http://localhost:8983/solr/select/?rows=0facet=trueq=*:*facet.field=suggestfacet.prefix=f

returns things like

lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
lst name=suggest
  int name=foo12/int
/lst
  /lst
/lst

But a request like
http://localhost:8983/solr/select/?rows=0facet=trueq=*:*facet.field=suggestfacet.prefix=F

returns just:

lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
lst name=suggest/
  /lst
/lst

That's not what I've expected, cause the field-definition contains a 
LowerCaseFilter.


Is it possible that the prefix-processing ignores the filters?

Max


Re: prefix-search ingnores the lowerCaseFilter

2007-10-25 Thread Yonik Seeley
On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
 Is it possible that the prefix-processing ignores the filters?

Yes, It's a known limitation that we haven't worked out a fix for yet.
The issue is that you can't just run the prefix through the filters
because of things like stop words, stemming, minimum length filters,
etc.

-Yonik