Accent Characters

2012-05-24 Thread couto.vicente
Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type "text_fr" that comes with the solr schema.xml file.



My spellchecker is almost the same that comes with solrconfig.xml:


  default
  content
  spellchecker
  
  


When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-25 Thread Jack Krupansky
I tried your scenario with the Solr 3.6 example and it seemed to work fine 
and suggested an accented term for me.


Some possibilities:

1) Your term had an editing distance that was too high relative to any 
accented correction. Check your term and count how many characters must be 
changed to match an accented term. Case changes count as well. In the case 
of a 4-character word, the maximum editing distance allowed (by default) is 
2. Maybe you simply need to override the default for "accuracy;  e.g., 
&spellcheck.accuracy=0.35, compared to the default of 0.5.
2) Did you get some other suggestion  when you expected the accented term? 
If so, increase the spellcheck.count request parameter from 1 to 10 see 
other suggestions.
3) You have some other schema/solrconfig changes that you haven't told us 
about.


Try to reproduce your issue against a fresh copy of Solr 3.6 example, and 
then see how your actual configuration (that fails) is different from the 
example.


Here's my test query and the spellcheck result :

http://localhost:8983/solr/spell?q=x%20Cafe%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true&spellcheck.count=10


 
   
 2
 2
 6
 
   café
   cofe
 
   
   x café y
 


And here was my test doc:

curl http://localhost:8983/solr/update?commit=true -H "Content-Type: 
text/xml" --data-binary 'doc-c1name="content">Internet café - Café au lait - Viennese coffee house - Maid 
café cofe'


Here is a test query that returns zero suggestions, because the editing 
distance is greater than two (Capital C, unaccented character, and extra 
character at end):


http://localhost:8983/solr/spell?q=x%20Cafex%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

But, by overriding the default "accuracy" of 0.5 and dropping it to 0.35, I 
can get the expected suggestion:


http://localhost:8983/solr/spell?q=x%20Cafex%20y&spellcheck=true&spellcheck.collate=true&spellcheck.build=true&spellcheck.accuracy=0.35

-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Thursday, May 24, 2012 10:28 AM
To: solr-user@lucene.apache.org
Subject: Accent Characters

Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type "text_fr" that comes with the solr schema.xml file.



My spellchecker is almost the same that comes with solrconfig.xml:

   
 default
 content
 spellchecker


   

When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-28 Thread couto.vicente
Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command "java
-jar start.jar":
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9senta&spellcheck=true&spellcheck.collate=true&rows=0&spellcheck.count=10

It gives me the result:
 

 
  0 
  31 
  
   
 
 
 
  10 
  8 
  16 
 
  présente 
  présent 
  présenté 
  présents 
  présentant 
  présentera 
  présentait 
  présentes 
  présenter 
  présentée 
  
  
  content:présente 
  
  


And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:
 

 
  0 
  16 
  
   
 
 
 
  10 
  8 
  16 
 
  present 
  prbsent 
  presentant 
  presentait 
  puisent 
  pasent 
  pensent 
  posent 
  dresent 
  resenti 
  
  
  content:present 
  
  


As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:



But it's not working as you "can see".
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-28 Thread Jack Krupansky
The query seems fine - as far as the URL being UTF-8. It seems that the 
documents are not being passed to Solr with UTF-8 encoding. The document is 
not part of the URL. It is HTTP POST data.


Try an explicit curl command to add a document and see if it is indexed with 
the accents.


-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Monday, May 28, 2012 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Accent Characters

Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command "java
-jar start.jar":
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9senta&spellcheck=true&spellcheck.collate=true&rows=0&spellcheck.count=10

It gives me the result:



 0
 31
 
 



 10
 8
 16

 présente
 présent
 présenté
 présents
 présentant
 présentera
 présentait
 présentes
 présenter
 présentée
 
 
 content:présente
 
 


And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:



 0
 16
 
 



 10
 8
 16

 present
 prbsent
 presentant
 presentait
 puisent
 pasent
 pensent
 posent
 dresent
 resenti
 
 
 content:present
 
 


As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:



But it's not working as you "can see".
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-30 Thread Vicente Couto
Hello, Jack.

Yeah, I'm screwed up.

Well, the documents are indexed with the accents.
I started a new clean solr 3.6 configuration, with as few changes as
possible; I'm running two cores, one for English and another one for French.
Here is where I am now: If I try to run queries by using solrJ, it does some
sort of encoding. For example, I can see into the logs that if I run one
query looking for "pré", I got

INFO: [coreFR] webapp=/solr path=/select
params={fl=*,score&q=content:pré&hl.fl=content&hl.maxAnalyzedChars=10&hl=true}
hits=0 status=0 QTime=0 

And I can't see any results. If I try by using encoding to UTF-8 it's not
works.
But if I simply put http calls into the browser address bar, for example, it
works perfectly!
So, how can I "tell" solrJ to not encode the queries?

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-30 Thread Jack Krupansky

This might be related:

https://issues.apache.org/jira/browse/SOLR-443

It suggests setting an HTTP header: Content-Type: 
application/x-www-form-urlencoded; charset=UTF-8


-- Jack Krupansky

-Original Message- 
From: Vicente Couto

Sent: Thursday, May 31, 2012 12:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Accent Characters

Hello, Jack.

Yeah, I'm screwed up.

Well, the documents are indexed with the accents.
I started a new clean solr 3.6 configuration, with as few changes as
possible; I'm running two cores, one for English and another one for French.
Here is where I am now: If I try to run queries by using solrJ, it does some
sort of encoding. For example, I can see into the logs that if I run one
query looking for "pré", I got

INFO: [coreFR] webapp=/solr path=/select
params={fl=*,score&q=content:pré&hl.fl=content&hl.maxAnalyzedChars=10&hl=true}
hits=0 status=0 QTime=0

And I can't see any results. If I try by using encoding to UTF-8 it's not
works.
But if I simply put http calls into the browser address bar, for example, it
works perfectly!
So, how can I "tell" solrJ to not encode the queries?

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-30 Thread Sami Siren
Vicente,

Are you using CommonsHttpSolrServer or HttpSolrServer? If the latter
then you are probably hitting this:
https://issues.apache.org/jira/browse/SOLR-3375

The remedy is to use CommonshHttpSolrServer.

--
 Sami Siren

On Thu, May 31, 2012 at 7:52 AM, Vicente Couto  wrote:
> Hello, Jack.
>
> Yeah, I'm screwed up.
>
> Well, the documents are indexed with the accents.
> I started a new clean solr 3.6 configuration, with as few changes as
> possible; I'm running two cores, one for English and another one for French.
> Here is where I am now: If I try to run queries by using solrJ, it does some
> sort of encoding. For example, I can see into the logs that if I run one
> query looking for "pré", I got
>
> INFO: [coreFR] webapp=/solr path=/select
> params={fl=*,score&q=content:pré&hl.fl=content&hl.maxAnalyzedChars=10&hl=true}
> hits=0 status=0 QTime=0
>
> And I can't see any results. If I try by using encoding to UTF-8 it's not
> works.
> But if I simply put http calls into the browser address bar, for example, it
> works perfectly!
> So, how can I "tell" solrJ to not encode the queries?
>
> Thank you
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-31 Thread Vicente Couto
Hello, guys.

Now it's working. Thank you both Jack and Sami.
I fixed my issue by just using server.query(query, METHOD.POST) in solrJ and
yes, I was using HttpSolrServer. I have to move on to CommonsHttpSolrServer.

Thank you very much.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3987046.html
Sent from the Solr - User mailing list archive at Nabble.com.


Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread joeMcElroy

I need a custom filter to be added to a field which will replace special
foreign characters with their english counterpart. 

for example ø => o
Grave À È Ì Ò Ù à è ì ò ù => A E I O U a e i o u 
Circumflex Â Ê Î Ô Û â ê î ô û  => A E I O U a e i o u

is this possible?

joe
-- 
View this message in context: 
http://www.nabble.com/Filters%3A-acute-accent-characters-replaced-with-their-english-counterpart-tp20416888p20416888.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Jarek Zgoda

Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy:

I need a custom filter to be added to a field which will replace  
special

foreign characters with their english counterpart.

for example ø => o
Grave À È Ì Ò Ù à è ì ò ù => A E I O U a e i o u
Circumflex Â Ê Î Ô Û â ê î ô û  => A E I O U a e i o u

is this possible?


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac

I wish such filter exist for Latin2...

--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
[EMAIL PROTECTED]



RE: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Steven A Rowe
Hi Jarek,

On 11/10/2008 at 6:08 AM, Jarek Zgoda wrote:
> Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy:
> > I need a custom filter to be added to a field which will replace
> > special foreign characters with their english counterpart.
> > 
> > for example ø => o
> > Grave À È Ì Ò Ù à è ì ò ù => A E I O U a e i o u
> > Circumflex Â Ê Î Ô Û â ê î ô û  => A E I O U a e i o u
> > 
> > is this possible?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac
> 
> I wish such filter exist for Latin2...

The following Lucene patch hasn't been committed yet, and there is no Solr 
Factory counterpart yet, but: ASCIIFoldingFilter folds all accented letters to 
their (accent-stripped, if necessary) ASCII equivalents:



Steve


Re: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Koji Sekiguchi

joe,

This hasn't been committed yet, but SOLR-822 may be your answer.

https://issues.apache.org/jira/browse/SOLR-822

Koji

joeMcElroy wrote:

I need a custom filter to be added to a field which will replace special
foreign characters with their english counterpart. 


for example ø => o
Grave À È Ì Ò Ù à è ì ò ù => A E I O U a e i o u 
Circumflex Â Ê Î Ô Û â ê î ô û  => A E I O U a e i o u


is this possible?

joe
  




Re: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread joeMcElroy

cheers for the quick response!

joe


-- 
View this message in context: 
http://www.nabble.com/Filters%3A-acute-accent-characters-replaced-with-their-english-counterpart-tp20416888p20418586.html
Sent from the Solr - User mailing list archive at Nabble.com.