Re: Wildcard-Search Solr 3.5.0

2012-06-03 Thread Erick Erickson
Chiming in late here, just back from vacation. But off the top of my
head, I don't see any reason SnowballPorterFilterFactory shouldn't
be MultiTermAware.

I've created https://issues.apache.org/jira/browse/SOLR-3503 as
a placeholder.

Erick

On Fri, May 25, 2012 at 1:31 PM,  spr...@gmx.eu wrote:
 I don't know the specific rules in these specific stemmers,
 but generally a
 less aggressive stemming (e.g., plural-only) of
 paintings would be
 painting, while a more aggressive stemming would be
 paint. For some
 aggressive stemmers the stemmed word is not even a word.

 Sounds logically :)

 It would be nice to have doc with some example words for each stemmer.

 Absolutely!

 Thx alot!



Re: Wildcard-Search Solr 3.5.0

2012-06-03 Thread Erick Erickson
And I closed the JIRA, see the comments. But the short form is that
it's not worth the effort because of the edge cases. Jack writes
up some of them; the short form is what does stemming
do with terms like organiz* . Sure, it would produce one token (which is
the main restriction on a MultiTermAware filter), but the output
might not be anything equivalent to the stem of organization, maybe
not even organize. Better to avoid that rat-hole, it seems like one of those
problems that could suck up enormous amounts of time and _still_ not
do what's expected.

If you _really_ want to try this, you could always define your own
multiterm analysis component that included the stemmer, see:
http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
But don't say I didn't warn you G...

Best
Erick

On Sun, Jun 3, 2012 at 8:25 AM, Erick Erickson erickerick...@gmail.com wrote:
 Chiming in late here, just back from vacation. But off the top of my
 head, I don't see any reason SnowballPorterFilterFactory shouldn't
 be MultiTermAware.

 I've created https://issues.apache.org/jira/browse/SOLR-3503 as
 a placeholder.

 Erick

 On Fri, May 25, 2012 at 1:31 PM,  spr...@gmx.eu wrote:
 I don't know the specific rules in these specific stemmers,
 but generally a
 less aggressive stemming (e.g., plural-only) of
 paintings would be
 painting, while a more aggressive stemming would be
 paint. For some
 aggressive stemmers the stemmed word is not even a word.

 Sounds logically :)

 It would be nice to have doc with some example words for each stemmer.

 Absolutely!

 Thx alot!



RE: Wildcard-Search Solr 3.5.0

2012-05-25 Thread spring
Oh, thx for the update! I didn't noticed that solr 3.6 has a text_de field
type. These two options... less / more aggressive. Aggressive in terms of
what?

Thank you!

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Freitag, 25. Mai 2012 03:25
 To: solr-user@lucene.apache.org
 Subject: Re: Wildcard-Search Solr 3.5.0
 
 I tried it and it does appear to be the 
 SnowballPorterFilterFactory that 
 normally does the accent folding but can't here because it is 
 not multi-term 
 aware. I did notice that the text_de field type that comes in 
 the Solr 3.6 
 example schema handles your case fine. It uses the 
 GermanNormalizationFilterFactory to fold accented characters and is 
 multi-term aware. Any particular reason you're not using the 
 stock text_de 
 field type? It also has three stemming options which might be 
 sufficient for 
 your needs.
 
 In any case, try to make your text_de field type closer to the stock 
 version, and try to use GermanNormalizationFilterFactory, and 
 that may be 
 good enough for your situation.



Re: Wildcard-Search Solr 3.5.0

2012-05-25 Thread Jack Krupansky
I don't know the specific rules in these specific stemmers, but generally a 
less aggressive stemming (e.g., plural-only) of paintings would be 
painting, while a more aggressive stemming would be paint. For some 
aggressive stemmers the stemmed word is not even a word.


It would be nice to have doc with some example words for each stemmer.

-- Jack Krupansky

-Original Message- 
From: spr...@gmx.eu

Sent: Friday, May 25, 2012 5:59 AM
To: solr-user@lucene.apache.org
Subject: RE: Wildcard-Search Solr 3.5.0

Oh, thx for the update! I didn't noticed that solr 3.6 has a text_de field
type. These two options... less / more aggressive. Aggressive in terms of
what?

Thank you!


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Freitag, 25. Mai 2012 03:25
To: solr-user@lucene.apache.org
Subject: Re: Wildcard-Search Solr 3.5.0

I tried it and it does appear to be the
SnowballPorterFilterFactory that
normally does the accent folding but can't here because it is
not multi-term
aware. I did notice that the text_de field type that comes in
the Solr 3.6
example schema handles your case fine. It uses the
GermanNormalizationFilterFactory to fold accented characters and is
multi-term aware. Any particular reason you're not using the
stock text_de
field type? It also has three stemming options which might be
sufficient for
your needs.

In any case, try to make your text_de field type closer to the stock
version, and try to use GermanNormalizationFilterFactory, and
that may be
good enough for your situation. 




RE: Wildcard-Search Solr 3.5.0

2012-05-25 Thread spring
 I don't know the specific rules in these specific stemmers, 
 but generally a 
 less aggressive stemming (e.g., plural-only) of 
 paintings would be 
 painting, while a more aggressive stemming would be 
 paint. For some 
 aggressive stemmers the stemmed word is not even a word.

Sounds logically :)

 It would be nice to have doc with some example words for each stemmer.

Absolutely!

Thx alot!



Re: Wildcard-Search Solr 3.5.0

2012-05-24 Thread Jack Krupansky
I tried it and it does appear to be the SnowballPorterFilterFactory that 
normally does the accent folding but can't here because it is not multi-term 
aware. I did notice that the text_de field type that comes in the Solr 3.6 
example schema handles your case fine. It uses the 
GermanNormalizationFilterFactory to fold accented characters and is 
multi-term aware. Any particular reason you're not using the stock text_de 
field type? It also has three stemming options which might be sufficient for 
your needs.


In any case, try to make your text_de field type closer to the stock 
version, and try to use GermanNormalizationFilterFactory, and that may be 
good enough for your situation.


-- Jack Krupansky

-Original Message- 
From: spr...@gmx.eu

Sent: Wednesday, May 23, 2012 10:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Wildcard-Search Solr 3.5.0


I'd guess that this is because SnowballPorterFilterFactory
does not implement MultiTermAwareComponent. Not sure, though.


Yes, I think this hinders the automagically multiterm awarness to do it's
job.
Could an own analyzer chain with analyzer type=multiterm help? Like
described (very, very short, too short...) here:
http://wiki.apache.org/solr/MultitermQueryAnalysis 



RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread spring
No one an idea?

Thx.


  The text may contain FooBar.
  
  When I do a wildcard search like this: Foo* - no hits.
  When I do a wildcard search like this: foo* - doc is
  found.
 
 Please see http://wiki.apache.org/solr/MultitermQueryAnalysis


Well, it works in 3.6. With one exception: If I use german umlauts it does
not work anymore.

Text: Bär

Bä* - no hits
Bär - hits

What can I do in this case?

Thank you



Re: Wildcard-Search Solr 3.5.0

2012-05-23 Thread Dmitry Kan
what about bä*-hits?

-- Dmitry

On Wed, May 23, 2012 at 2:19 PM, spr...@gmx.eu wrote:

 No one an idea?

 Thx.


   The text may contain FooBar.
  
   When I do a wildcard search like this: Foo* - no hits.
   When I do a wildcard search like this: foo* - doc is
   found.
 
  Please see http://wiki.apache.org/solr/MultitermQueryAnalysis


 Well, it works in 3.6. With one exception: If I use german umlauts it does
 not work anymore.

 Text: Bär

 Bä* - no hits
 Bär - hits

 What can I do in this case?

 Thank you




-- 
Regards,

Dmitry Kan


RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread spring
No. No hits for bä*.
It's something with the umlauts but I have no idea what...

 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com] 
 Sent: Mittwoch, 23. Mai 2012 13:36
 To: solr-user@lucene.apache.org
 Subject: Re: Wildcard-Search Solr 3.5.0
 
 what about bä*-hits?
 
 -- Dmitry
 
 On Wed, May 23, 2012 at 2:19 PM, spr...@gmx.eu wrote:
 
  No one an idea?
 
  Thx.
 
 
The text may contain FooBar.
   
When I do a wildcard search like this: Foo* - no hits.
When I do a wildcard search like this: foo* - doc is
found.
  
   Please see http://wiki.apache.org/solr/MultitermQueryAnalysis
 
 
  Well, it works in 3.6. With one exception: If I use german 
 umlauts it does
  not work anymore.
 
  Text: Bär
 
  Bä* - no hits
  Bär - hits
 
  What can I do in this case?
 
  Thank you
 
 
 
 
 -- 
 Regards,
 
 Dmitry Kan
 



Re: Wildcard-Search Solr 3.5.0

2012-05-23 Thread Dmitry Kan
do umlauts arrive properly on the server side, no encoding issues? Check
the query params of the response xml/json/.. set debugQuery to true as well
to see if it produces any useful diagnostic info.

On Wed, May 23, 2012 at 2:58 PM, spr...@gmx.eu wrote:

 No. No hits for bä*.
 It's something with the umlauts but I have no idea what...

  -Original Message-
  From: Dmitry Kan [mailto:dmitry@gmail.com]
  Sent: Mittwoch, 23. Mai 2012 13:36
  To: solr-user@lucene.apache.org
  Subject: Re: Wildcard-Search Solr 3.5.0
 
  what about bä*-hits?
 
  -- Dmitry
 
  On Wed, May 23, 2012 at 2:19 PM, spr...@gmx.eu wrote:
 
   No one an idea?
  
   Thx.
  
  
 The text may contain FooBar.

 When I do a wildcard search like this: Foo* - no hits.
 When I do a wildcard search like this: foo* - doc is
 found.
   
Please see http://wiki.apache.org/solr/MultitermQueryAnalysis
  
  
   Well, it works in 3.6. With one exception: If I use german
  umlauts it does
   not work anymore.
  
   Text: Bär
  
   Bä* - no hits
   Bär - hits
  
   What can I do in this case?
  
   Thank you
  
  
 
 
  --
  Regards,
 
  Dmitry Kan
 




-- 
Regards,

Dmitry Kan


RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread spring
 

 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com] 
 Sent: Mittwoch, 23. Mai 2012 14:02
 To: solr-user@lucene.apache.org
 Subject: Re: Wildcard-Search Solr 3.5.0
 
 do umlauts arrive properly on the server side, no encoding 
 issues?

Yes, works fine.

It must, since I have hits for Bär or bär.
It's just the combination between umlauts and wildcards.
Must be something with the automagically Multiterm feature in Solr 3.6.




Re: Wildcard-Search Solr 3.5.0

2012-05-23 Thread Jens Grivolla
Maybe a filter like ISOLatin1AccentFilter that doesn't get applied when 
using wildcards? How do the terms actually appear in the index?


Jens

On 05/23/2012 01:19 PM, spr...@gmx.eu wrote:

No one an idea?

Thx.



The text may contain FooBar.

When I do a wildcard search like this: Foo* - no hits.
When I do a wildcard search like this: foo* - doc is
found.


Please see http://wiki.apache.org/solr/MultitermQueryAnalysis



Well, it works in 3.6. With one exception: If I use german umlauts it does
not work anymore.

Text: Bär

Bä* -  no hits
Bär -  hits

What can I do in this case?

Thank you







RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread spring
 Maybe a filter like ISOLatin1AccentFilter that doesn't get 
 applied when 
 using wildcards? How do the terms actually appear in the index?

Bär get indexed as bar.

I use not ISOLatin1AccentFilter . My field def is this:

fieldType name=text_de class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=German2
/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
 filter class=solr.SnowballPorterFilterFactory language=German2
/
  /analyzer
/fieldType
 /types



RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread Michael Ryan
I'd guess that this is because SnowballPorterFilterFactory does not implement 
MultiTermAwareComponent. Not sure, though.

-Michael


RE: Wildcard-Search Solr 3.5.0

2012-05-23 Thread spring
 I'd guess that this is because SnowballPorterFilterFactory 
 does not implement MultiTermAwareComponent. Not sure, though.

Yes, I think this hinders the automagically multiterm awarness to do it's
job.
Could an own analyzer chain with analyzer type=multiterm help? Like
described (very, very short, too short...) here:
http://wiki.apache.org/solr/MultitermQueryAnalysis



RE: Wildcard-Search Solr 3.5.0

2012-05-22 Thread spring
  The text may contain FooBar.
  
  When I do a wildcard search like this: Foo* - no hits.
  When I do a wildcard search like this: foo* - doc is
  found.
 
 Please see http://wiki.apache.org/solr/MultitermQueryAnalysis


Well, it works in 3.6. With one exception: If I use german umlauts it does
not work anymore.

Text: Bär

Bä* - no hits
Bär - hits

What can I do in this case?

Thank you



Re: Wildcard-Search Solr 3.5.0

2012-05-20 Thread Ahmet Arslan
 The text may contain FooBar.
 
 When I do a wildcard search like this: Foo* - no hits.
 When I do a wildcard search like this: foo* - doc is
 found.

Please see http://wiki.apache.org/solr/MultitermQueryAnalysis


RE: Wildcard-Search Solr 3.5.0

2012-05-20 Thread spring
Hi Ahmet,

 Please see http://wiki.apache.org/solr/MultitermQueryAnalysis

so your advice is to upgrade to 3.6? 

Thank you



RE: Wildcard-Search Solr 3.5.0

2012-05-20 Thread Ahmet Arslan

 so your advice is to upgrade to 3.6? 

Or, as a workaround, you can lowercase wildcard queries on the client side.