Re: solr wildcard queries and analyzers

2011-01-12 Thread Kári Hreinsson
Have you made any progress?  Since the AnalyzingQueryParser doesn't inherit 
from QParserPlugin solr doesn't want to use it but I guess we could implement a 
similar parser that does inherit from QParserPlugin?

Switching parser seems to be what is needed?  Has really no one solved this 
before?

- Kári

- Original Message -
From: Matti Oinas matti.oi...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers

This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas matti.oi...@gmail.com:
 Sorry, the message was not meant to be sent here. We are struggling
 with the same problem here.

 2011/1/11 Matti Oinas matti.oi...@gmail.com:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

 On wildcard and fuzzy searches, no text analysis is performed on the
 search word.

 2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
 Hi,

 I am having a problem with the fact that no text analysis are performed on 
 wildcard queries.  I have the following field type (a bit simplified):
    fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType

 My problem has to do with Icelandic characters, when I index a document 
 with a text field including the word sjálfsögðu it gets indexed as 
 sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the 
 Icelandic characters with their English equivalents).  Then, when I search 
 (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document 
 as a result.  This is convenient since it enables people to search without 
 using accented characters and yet get the results they want (e.g. if they 
 are working on computers with English keyboards).

 However this all falls apart when using wildcard searches, then the search 
 string isn't passed through the filters, and even if I search for sjálf* 
 I don't get any results because the index doesn't contain the original 
 words (I get result if I search for sjalf*).  I know people have been 
 having a similar problem with the case sensitivity of wildcard queries and 
 most often the solution seems to be to lowercase the string before passing 
 it on to solr, which is not exactly an optimal solution (yet a simple one 
 in that case).  The Icelandic characters complicate things a bit and 
 applying the same solution (doing the lowercasing and character mapping) in 
 my application seems like unnecessary duplication of code already part of 
 solr, not to mention complication of my application and possible 
 maintenance down the road.

 Is there any way around this?  How are people solving this?  Is there a way 
 to apply the filters to wildcard queries?  I guess removing the 
 ASCIIFoldingFilterFactory is the simplest solution but this 
 normalization (of the text done by the filter) is often very useful.

 I hope I'm not overlooking some obvious explanation. :/

 Thanks in advance,
 Kári Hreinsson





Re: solr wildcard queries and analyzers

2011-01-12 Thread Jayendra Patil
Had the same issues with international characters and wildcard searches.

One workaround we implemented, was to index the field with and without the
ASCIIFoldingFilterFactory.
You would have an original field and one with english equivalent to be used
during searching.

Wildcard searches with english equivalent or international terms would match
either of those.
Also, lowere case the search terms if you are using lowercasefilter during
indexing.

Reagrds,
Jayendra

On Wed, Jan 12, 2011 at 7:46 AM, Kári Hreinsson k...@gagnavarslan.iswrote:

 Have you made any progress?  Since the AnalyzingQueryParser doesn't inherit
 from QParserPlugin solr doesn't want to use it but I guess we could
 implement a similar parser that does inherit from QParserPlugin?

 Switching parser seems to be what is needed?  Has really no one solved this
 before?

 - Kári

 - Original Message -
 From: Matti Oinas matti.oi...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, 11 January, 2011 12:47:52 PM
 Subject: Re: solr wildcard queries and analyzers

 This might be the solution.


 http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

 2011/1/11 Matti Oinas matti.oi...@gmail.com:
  Sorry, the message was not meant to be sent here. We are struggling
  with the same problem here.
 
  2011/1/11 Matti Oinas matti.oi...@gmail.com:
  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
 
  On wildcard and fuzzy searches, no text analysis is performed on the
  search word.
 
  2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
  Hi,
 
  I am having a problem with the fact that no text analysis are performed
 on wildcard queries.  I have the following field type (a bit simplified):
 fieldType name=text class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.TrimFilterFactory /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.ASCIIFoldingFilterFactory /
   /analyzer
 /fieldType
 
  My problem has to do with Icelandic characters, when I index a document
 with a text field including the word sjálfsögðu it gets indexed as
 sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the
 Icelandic characters with their English equivalents).  Then, when I search
 (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document as
 a result.  This is convenient since it enables people to search without
 using accented characters and yet get the results they want (e.g. if they
 are working on computers with English keyboards).
 
  However this all falls apart when using wildcard searches, then the
 search string isn't passed through the filters, and even if I search for
 sjálf* I don't get any results because the index doesn't contain the
 original words (I get result if I search for sjalf*).  I know people have
 been having a similar problem with the case sensitivity of wildcard queries
 and most often the solution seems to be to lowercase the string before
 passing it on to solr, which is not exactly an optimal solution (yet a
 simple one in that case).  The Icelandic characters complicate things a bit
 and applying the same solution (doing the lowercasing and character mapping)
 in my application seems like unnecessary duplication of code already part of
 solr, not to mention complication of my application and possible maintenance
 down the road.
 
  Is there any way around this?  How are people solving this?  Is there a
 way to apply the filters to wildcard queries?  I guess removing the
 ASCIIFoldingFilterFactory is the simplest solution but this
 normalization (of the text done by the filter) is often very useful.
 
  I hope I'm not overlooking some obvious explanation. :/
 
  Thanks in advance,
  Kári Hreinsson
 
 
 



Re: solr wildcard queries and analyzers

2011-01-12 Thread Matti Oinas
I'm little busy right now, but I'm going to try to find suitable
parser or if none is found then I think the only solution is to write
a new one.

2011/1/13 Jayendra Patil jayendra.patil@gmail.com:
 Had the same issues with international characters and wildcard searches.

 One workaround we implemented, was to index the field with and without the
 ASCIIFoldingFilterFactory.
 You would have an original field and one with english equivalent to be used
 during searching.

 Wildcard searches with english equivalent or international terms would match
 either of those.
 Also, lowere case the search terms if you are using lowercasefilter during
 indexing.

 Reagrds,
 Jayendra

 On Wed, Jan 12, 2011 at 7:46 AM, Kári Hreinsson k...@gagnavarslan.iswrote:

 Have you made any progress?  Since the AnalyzingQueryParser doesn't inherit
 from QParserPlugin solr doesn't want to use it but I guess we could
 implement a similar parser that does inherit from QParserPlugin?

 Switching parser seems to be what is needed?  Has really no one solved this
 before?

 - Kári

 - Original Message -
 From: Matti Oinas matti.oi...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, 11 January, 2011 12:47:52 PM
 Subject: Re: solr wildcard queries and analyzers

 This might be the solution.


 http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

 2011/1/11 Matti Oinas matti.oi...@gmail.com:
  Sorry, the message was not meant to be sent here. We are struggling
  with the same problem here.
 
  2011/1/11 Matti Oinas matti.oi...@gmail.com:
  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
 
  On wildcard and fuzzy searches, no text analysis is performed on the
  search word.
 
  2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
  Hi,
 
  I am having a problem with the fact that no text analysis are performed
 on wildcard queries.  I have the following field type (a bit simplified):
     fieldType name=text class=solr.TextField
 positionIncrementGap=100
       analyzer
         tokenizer class=solr.WhitespaceTokenizerFactory /
         filter class=solr.TrimFilterFactory /
         filter class=solr.LowerCaseFilterFactory /
         filter class=solr.ASCIIFoldingFilterFactory /
       /analyzer
     /fieldType
 
  My problem has to do with Icelandic characters, when I index a document
 with a text field including the word sjálfsögðu it gets indexed as
 sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the
 Icelandic characters with their English equivalents).  Then, when I search
 (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document as
 a result.  This is convenient since it enables people to search without
 using accented characters and yet get the results they want (e.g. if they
 are working on computers with English keyboards).
 
  However this all falls apart when using wildcard searches, then the
 search string isn't passed through the filters, and even if I search for
 sjálf* I don't get any results because the index doesn't contain the
 original words (I get result if I search for sjalf*).  I know people have
 been having a similar problem with the case sensitivity of wildcard queries
 and most often the solution seems to be to lowercase the string before
 passing it on to solr, which is not exactly an optimal solution (yet a
 simple one in that case).  The Icelandic characters complicate things a bit
 and applying the same solution (doing the lowercasing and character mapping)
 in my application seems like unnecessary duplication of code already part of
 solr, not to mention complication of my application and possible maintenance
 down the road.
 
  Is there any way around this?  How are people solving this?  Is there a
 way to apply the filters to wildcard queries?  I guess removing the
 ASCIIFoldingFilterFactory is the simplest solution but this
 normalization (of the text done by the filter) is often very useful.
 
  I hope I'm not overlooking some obvious explanation. :/
 
  Thanks in advance,
  Kári Hreinsson
 
 
 




solr wildcard queries and analyzers

2011-01-11 Thread Kári Hreinsson
Hi,

I am having a problem with the fact that no text analysis are performed on 
wildcard queries.  I have the following field type (a bit simplified):
fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ASCIIFoldingFilterFactory /
  /analyzer
/fieldType

My problem has to do with Icelandic characters, when I index a document with a 
text field including the word sjálfsögðu it gets indexed as sjalfsogdu 
(because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
characters with their English equivalents).  Then, when I search (without a 
wildcard) for sjálfsögðu or sjalfsogdu I get that document as a result.  
This is convenient since it enables people to search without using accented 
characters and yet get the results they want (e.g. if they are working on 
computers with English keyboards).

However this all falls apart when using wildcard searches, then the search 
string isn't passed through the filters, and even if I search for sjálf* I 
don't get any results because the index doesn't contain the original words (I 
get result if I search for sjalf*).  I know people have been having a similar 
problem with the case sensitivity of wildcard queries and most often the 
solution seems to be to lowercase the string before passing it on to solr, 
which is not exactly an optimal solution (yet a simple one in that case).  The 
Icelandic characters complicate things a bit and applying the same solution 
(doing the lowercasing and character mapping) in my application seems like 
unnecessary duplication of code already part of solr, not to mention 
complication of my application and possible maintenance down the road.

Is there any way around this?  How are people solving this?  Is there a way to 
apply the filters to wildcard queries?  I guess removing the 
ASCIIFoldingFilterFactory is the simplest solution but this normalization 
(of the text done by the filter) is often very useful.

I hope I'm not overlooking some obvious explanation. :/

Thanks in advance,
Kári Hreinsson


Re: solr wildcard queries and analyzers

2011-01-11 Thread Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

On wildcard and fuzzy searches, no text analysis is performed on the
search word.

2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
 Hi,

 I am having a problem with the fact that no text analysis are performed on 
 wildcard queries.  I have the following field type (a bit simplified):
    fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType

 My problem has to do with Icelandic characters, when I index a document with 
 a text field including the word sjálfsögðu it gets indexed as sjalfsogdu 
 (because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
 characters with their English equivalents).  Then, when I search (without a 
 wildcard) for sjálfsögðu or sjalfsogdu I get that document as a result.  
 This is convenient since it enables people to search without using accented 
 characters and yet get the results they want (e.g. if they are working on 
 computers with English keyboards).

 However this all falls apart when using wildcard searches, then the search 
 string isn't passed through the filters, and even if I search for sjálf* I 
 don't get any results because the index doesn't contain the original words (I 
 get result if I search for sjalf*).  I know people have been having a 
 similar problem with the case sensitivity of wildcard queries and most often 
 the solution seems to be to lowercase the string before passing it on to 
 solr, which is not exactly an optimal solution (yet a simple one in that 
 case).  The Icelandic characters complicate things a bit and applying the 
 same solution (doing the lowercasing and character mapping) in my application 
 seems like unnecessary duplication of code already part of solr, not to 
 mention complication of my application and possible maintenance down the road.

 Is there any way around this?  How are people solving this?  Is there a way 
 to apply the filters to wildcard queries?  I guess removing the 
 ASCIIFoldingFilterFactory is the simplest solution but this normalization 
 (of the text done by the filter) is often very useful.

 I hope I'm not overlooking some obvious explanation. :/

 Thanks in advance,
 Kári Hreinsson



Re: solr wildcard queries and analyzers

2011-01-11 Thread Matti Oinas
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.

2011/1/11 Matti Oinas matti.oi...@gmail.com:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

 On wildcard and fuzzy searches, no text analysis is performed on the
 search word.

 2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
 Hi,

 I am having a problem with the fact that no text analysis are performed on 
 wildcard queries.  I have the following field type (a bit simplified):
    fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType

 My problem has to do with Icelandic characters, when I index a document with 
 a text field including the word sjálfsögðu it gets indexed as sjalfsogdu 
 (because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
 characters with their English equivalents).  Then, when I search (without a 
 wildcard) for sjálfsögðu or sjalfsogdu I get that document as a result.  
 This is convenient since it enables people to search without using accented 
 characters and yet get the results they want (e.g. if they are working on 
 computers with English keyboards).

 However this all falls apart when using wildcard searches, then the search 
 string isn't passed through the filters, and even if I search for sjálf* I 
 don't get any results because the index doesn't contain the original words 
 (I get result if I search for sjalf*).  I know people have been having a 
 similar problem with the case sensitivity of wildcard queries and most often 
 the solution seems to be to lowercase the string before passing it on to 
 solr, which is not exactly an optimal solution (yet a simple one in that 
 case).  The Icelandic characters complicate things a bit and applying the 
 same solution (doing the lowercasing and character mapping) in my 
 application seems like unnecessary duplication of code already part of solr, 
 not to mention complication of my application and possible maintenance down 
 the road.

 Is there any way around this?  How are people solving this?  Is there a way 
 to apply the filters to wildcard queries?  I guess removing the 
 ASCIIFoldingFilterFactory is the simplest solution but this 
 normalization (of the text done by the filter) is often very useful.

 I hope I'm not overlooking some obvious explanation. :/

 Thanks in advance,
 Kári Hreinsson




Re: solr wildcard queries and analyzers

2011-01-11 Thread Matti Oinas
This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas matti.oi...@gmail.com:
 Sorry, the message was not meant to be sent here. We are struggling
 with the same problem here.

 2011/1/11 Matti Oinas matti.oi...@gmail.com:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

 On wildcard and fuzzy searches, no text analysis is performed on the
 search word.

 2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
 Hi,

 I am having a problem with the fact that no text analysis are performed on 
 wildcard queries.  I have the following field type (a bit simplified):
    fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType

 My problem has to do with Icelandic characters, when I index a document 
 with a text field including the word sjálfsögðu it gets indexed as 
 sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the 
 Icelandic characters with their English equivalents).  Then, when I search 
 (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document 
 as a result.  This is convenient since it enables people to search without 
 using accented characters and yet get the results they want (e.g. if they 
 are working on computers with English keyboards).

 However this all falls apart when using wildcard searches, then the search 
 string isn't passed through the filters, and even if I search for sjálf* 
 I don't get any results because the index doesn't contain the original 
 words (I get result if I search for sjalf*).  I know people have been 
 having a similar problem with the case sensitivity of wildcard queries and 
 most often the solution seems to be to lowercase the string before passing 
 it on to solr, which is not exactly an optimal solution (yet a simple one 
 in that case).  The Icelandic characters complicate things a bit and 
 applying the same solution (doing the lowercasing and character mapping) in 
 my application seems like unnecessary duplication of code already part of 
 solr, not to mention complication of my application and possible 
 maintenance down the road.

 Is there any way around this?  How are people solving this?  Is there a way 
 to apply the filters to wildcard queries?  I guess removing the 
 ASCIIFoldingFilterFactory is the simplest solution but this 
 normalization (of the text done by the filter) is often very useful.

 I hope I'm not overlooking some obvious explanation. :/

 Thanks in advance,
 Kári Hreinsson