Solr Phonetic Search returning documents but not Highlight Information

2013-07-04 Thread snkar
We have a pretty simple Solr Schema:

fields
   field name=DocId type=long indexed=true stored=true
required=true /
 field name=DocTitle type=string indexed=true stored=true
required=true /
 field name=Content type=text_general indexed=false stored=true
required=true /
 
 field name=ContentSearch type=text_general indexed=true
stored=false multiValued=true/
 field name=ContentSearchStemming type=text_stem indexed=true
stored=false multiValued=true/
 field name=ContentSearchPhonetic type=text_phonetic indexed=true
stored=false multiValued=true/
 field name=ContentSearchSynonym type=text_synonym indexed=true
stored=false multiValued=true/
 field name=_version_ type=long indexed=true stored=true/
 /fields
 
 uniqueKeyDocId/uniqueKey
 copyField source=Content dest=ContentSearch/
 copyField source=Content dest=ContentSearchStemming/
 copyField source=Content dest=ContentSearchPhonetic/
 copyField source=Content dest=ContentSearchSynonym/
 
 fieldType name=text_general class=solr.TextField
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

fieldType name=text_stem class=solr.TextField 
analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SnowballPorterFilterFactory/
/analyzer  
 /fieldType
 
 fieldType name=text_phonetic class=solr.TextField 
analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.PhoneticFilterFactory encoder=Soundex
inject=false/
/analyzer  
 /fieldType
 
 fieldType name=text_synonym class=solr.TextField 
 analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
  /analyzer 
 /fieldType

We are indexing documents in Solr using Solrnet and have a requirement to
support Phonetic Search based on the Soundex algorithm. Once we have indexed
documents, we can search in the Solr Admin Panel using a Phonetic query and
the relevant document is returned in the Search Results but the highlight
collection is blank.

Eg. Use Case:
--
We index a text document which contains the word electromagnetic(Soundex
Code: E423)
We execute a Search in the Solr Admin Panel using the following query:
ContentSearchPhonetic:electing(Soundex Code: E423).
The Search shows one document returned but the highlight collection is
blank.
Solr is definitely using the Phonetic Soundex algorithm to locate the
document as the word electing is not present in the document. But somehow
it is not being able to return the highlight data.
The same schema and config can successfully return documents along with
highlight data for other approximate searches like synonym, fuzzy or
stemming. Only for phonetic search, we are not getting the highlight data.
The screenshot from the Solr Admin Panle is shown below:
http://lucene.472066.n3.nabble.com/file/n4075492/HighlightIssue.png 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Phonetic-Search-returning-documents-but-not-Highlight-Information-tp4075492.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stemming query in Solr

2013-07-01 Thread snkar
Hi Erick,

Thanks for the reply.

Here is what the situation is:

Relevant portion of Solr Schema:
lt;field name=Content type=text_general indexed=false stored=true 
required=true/gt;
lt;field name=ContentSearch type=text_general indexed=true 
stored=false multiValued=true/gt;
lt;field name=ContentSearchStemming type=text_stem indexed=true 
stored=false multiValued=true/gt;
lt;copyField source=Content dest=ContentSearch/gt;
lt;copyField source=Content dest=ContentSearchStemming/gt;

lt;fieldType name=text_general class=solr.TextField 
positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer 
class=solr.StandardTokenizerFactory/gt; lt;filter 
class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true /gt; lt;filter 
class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer 
type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; 
lt;filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /gt; lt;filter 
class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt;

lt;fieldType name=text_stem class=solr.TextField gt; lt;analyzergt; 
lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt; lt;filter 
class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt; 
lt;/fieldTypegt;
When I am indexing a document, the content gets stored as is in the Content 
field and gets copied over to ContentSearch and ContentSearchStemming for text 
based search and stemming search respectively. So, the ContentSearchStemming 
field does store the
stem/reduced form of the terms. I have checked this with the Luke as well as 
the Admin Schema Browser --gt; Term Info. In the Admin
Analysis screen, I have tested and found that if I index the text burning, it 
gets reduced to and stored as burn. So far so good.

Now in the UI, 
lets say the user puts in the term burn and checks the stemming option. The 
expectation is that since the user has specified stemming, the results should 
be returned for the term burn as well as for all terms which has their stem 
as burn i.e. burning, burned, burns, etc.
lets say the user puts in the term burning and checks the stemming option. 
The expectation is that since the user has specified stemming, the results 
should be returned for the term burning as well as for all terms which has 
their stem as burn i.e. burn, burned, burns, etc.
The query that gets submitted to Solr: q=ContentSearchStemming:burning
From Debug Info: 
lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt;
lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt;
lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt;
lt;str name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt;
So, when the results are returned, I am only getting the hits highlighted with 
the term burn, though the same document contains terms like burning and 
burns.

I thought that the stemming should work like this: 
The stemming filter in the queryanalyzer chain would reduce the input word to 
its stem. burning --gt; burn
The query component should scan through the terms and match those terms for 
which it finds a match between the stem of the term with the stem of the input 
term. burns --gt; burn (matches) burning --gt; burn
The first point is happening. But looks like it is executing the search for an 
exact text based match with the stem burn. Hence, burns or burned are not 
getting returned.
Hope I was able to make myself clear.

 On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] 
lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote  


 First, this is for the Java version, I hope it extends to C#. 

But in your configuration, when you're indexing the stemmer 
should be storing the reduced form in the index. Then, when 
searching, the search should be against the reduced term. 
To check this, try 
1gt; Using the Admin/Analysis page to see what gets stored 
 in your index and what your query is transformed to to 
 insure that you're getting what you expect. 

If you want to get in deeper to the details, try 
1gt; use, say, the TermsComponent or Admin/Schema Browser 
 or Luke to look in your index and see what's actually 
there. 
2gt; us amp;debug=query or Admin/Analysis to see what the query 
actually looks like. 

Both your use-cases should work fine just with reduction 
_unless_ the particular word you look for doesn't happen to 
trip the stemmer. By that I mean that since it's algorithmically 
based, there may be some edge cases that seem like they 
should be reduced that aren't. I don't know whether fisherman 
would reduce to fish for instance. 

So are you seeing things that really don't work as expected or 
are you just working from the docs? Because I really don't 
see why you wouldn't get what you want given your description. 

Best 
Erick 


On Fri, Jun 28, 2013 at 2:33 AM, snkar lt;[hidden email]gt; wrote: 

gt; We have a search system based

Re: Stemming query in Solr

2013-07-01 Thread snkar

So the general solution is to index the field twice, once with stemming and 
once without in order to have the ability to do both stemmed and exact matches 

I am already indexing the text twice using the ContentSearch and 
ContentSearchStemming fields. But what this allows me is to return burning as 
well as burn if the user specifies burning as the input search term, 
burning being the exact match:

ContentSearch:burning + ContentSearchStemming:burn(reduced from 
ContentSearchStemming:burning)

What I cannot figure out is how is this going to help me in instructing Solr to 
execute the query for the different grammatical variations of the input search 
term stem i.e. stemming query for burning expands to text based query for 
burn, burns, burned, burning, etc.

You mentioned something about synonym. This was also mentioned in the Solr Wiki:
A related technology to stemming is lemmatization, which allows for stemming 
by expansion, taking a root word and 'expanding' it to all of its various 
forms. Lemmatization can be used either at insertion time or at query time. 
Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory  

I think what I need is exactly this point:

Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory

But I am not sure, how to go about it and exactly how can Synonym help me here 
as I am not looking for synonyms, rather different expansions of the stemmed 
word.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] 
lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem burn. 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
lt;Ggt;.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the stem bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: 

gt; Hi Erick, 
gt; 
gt; Thanks for the reply. 
gt; 
gt; Here is what the situation is: 
gt; 
gt; Relevant portion of Solr Schema: 
gt; amp;lt;field name=Content type=text_general indexed=false 
stored=true 
gt; required=true/amp;gt; 
gt; amp;lt;field name=ContentSearch type=text_general indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; 
gt; 
gt; amp;lt;fieldType name=text_general class=solr.TextField 
gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; 
amp;lt;tokenizer 
gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter 
gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
gt; enablePositionIncrements=true /amp;gt; amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
amp;lt;analyzer 
gt; type=queryamp;gt; amp;lt;tokenizer 
class=solr.StandardTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true 
gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; 
amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; 
gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; 
gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer 
class=solr.WhitespaceTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; 
amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; When I am indexing a document, the content gets stored as is in the 
gt; Content field and gets copied over to ContentSearch and 
gt; ContentSearchStemming for text based search and stemming search 
gt; respectively. So, the ContentSearchStemming field does store the 
gt; stem/reduced form of the terms. I have checked this with the Luke as well 
gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin 
gt; Analysis screen, I have tested and found that if I index the text 
gt; burning, it gets reduced to and stored as burn. So far so good

Re: Stemming query in Solr

2013-07-01 Thread snkar
I was just wondering if another solution might work. If we are able to extract 
the stem of the input search term(maybe using a C# based stemmer, some open 
source implementation of the Porter algorithm) for cases where the stemming 
option is selected, and submit the query to solr as a multiple character wild 
card query with respect to the stem, it should return me all the different 
variations of the stemmed word.

Example:

Search Term: burning
Stem: burn
Modified Query: burn*
Results: burn, burning, burns, burnt, etc.

I am sure this is not the proper way of executing a stemming by expansion, but 
this might just get the job done. What do you think? Trying to think of test 
case where this will fail.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via 
Lucene]lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem burn. 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
lt;Ggt;.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the stem bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: 

gt; Hi Erick, 
gt; 
gt; Thanks for the reply. 
gt; 
gt; Here is what the situation is: 
gt; 
gt; Relevant portion of Solr Schema: 
gt; amp;lt;field name=Content type=text_general indexed=false 
stored=true 
gt; required=true/amp;gt; 
gt; amp;lt;field name=ContentSearch type=text_general indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; 
gt; 
gt; amp;lt;fieldType name=text_general class=solr.TextField 
gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; 
amp;lt;tokenizer 
gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter 
gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
gt; enablePositionIncrements=true /amp;gt; amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
amp;lt;analyzer 
gt; type=queryamp;gt; amp;lt;tokenizer 
class=solr.StandardTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true 
gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; 
amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; 
gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; 
gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer 
class=solr.WhitespaceTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; 
amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; When I am indexing a document, the content gets stored as is in the 
gt; Content field and gets copied over to ContentSearch and 
gt; ContentSearchStemming for text based search and stemming search 
gt; respectively. So, the ContentSearchStemming field does store the 
gt; stem/reduced form of the terms. I have checked this with the Luke as well 
gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin 
gt; Analysis screen, I have tested and found that if I index the text 
gt; burning, it gets reduced to and stored as burn. So far so good. 
gt; 
gt; Now in the UI, 
gt; lets say the user puts in the term burn and checks the stemming option. 
gt; The expectation is that since the user has specified stemming, the results 
gt; should be returned for the term burn as well as for all terms which has 
gt; their stem as burn i.e. burning, burned, burns, etc. 
gt; lets say the user puts in the term burning and checks the stemming 
gt; option. The expectation is that since the user has specified stemming, the 
gt; results should be returned for the term burning as well as for all terms 
gt; which has their stem as burn i.e. burn, burned, burns, etc. 
gt; The query that gets submitted to Solr: q=ContentSearchStemming:burning 
gt; From Debug Info: 
gt; amp;lt;str 
name=rawquerystringamp;gt;ContentSearchStemming:burningamp;lt;/stramp;gt; 
gt; amp;lt;str 
name=querystringamp;gt;ContentSearchStemming:burningamp;lt

Stemming query in Solr

2013-06-28 Thread snkar
We have a search system based on Solr using the Solrnet library in C# which
supports some advanced search features like Fuzzy, Synonym and Stemming.
While all of these work, *the expectation from the Stemming Search seems to
be a combination of Stemming by reduction as well as stemming by expansion
to cover grammatical variations on a word*. A use case will make it more
clear:

 - a search for fish would also find fishing
 - a search for applied would also find applying, applies, and apply

We had implemented Stemming using a CopyField with
SnowballPorterFilterFactory. *As a result, when /searching for burning the
results are returning for burning and burn/ but when /searching for burn the
results are not returning for burning or burnt or burns/*

Since all stemmers supported Lucene/Solr all use stemming by reduction, we
are not sure on how to go about this. As per the Solr Wiki: 

 A related technology to stemming is lemmatization, which allows for
 stemming by expansion, taking a root word and 'expanding' it to all of
 its various forms. Lemmatization can be used either at insertion time or
 at query time. Lucene/Solr does not have built-in support for
 lemmatization but it can be simulated by using your own dictionaries and
 the SynonymFilterFactory

We are not sure of exactly how to go about this in Solr. Any ideas.

We were also thinking in terms of using some C# based stemmer/lemmatizer
library to get the root of the word and using some public database like
WordNet to extract the different grammatical variations of the stem and then
send across all these terms for querying in Solr. We have not yet done a lot
of research to figure out a stable C# stemmer/lemmatizer and a WordNet C#
API, but seems like this will get too convoluted and it should have a way to
be executed from within Solr.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862.html
Sent from the Solr - User mailing list archive at Nabble.com.