RE: HTMLStripCharFilterFactory does not replace #233;

2009-11-19 Thread Kundig, Andreas
It now works for me too. The problem was that tomcat was still working with an 
older version of the configuration. HTMLStripCharFilterFactory didn't even 
appear in analysis.jsp.

Thank you for looking into this.

Andréas

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp]
Sent: jeudi, 19. novembre 2009 06:59
To: solr-user@lucene.apache.org
Subject: Re: HTMLStripCharFilterFactory does not replace é

Your first definition of text_fr seems to be correct and should work
as expected. I tested it and worked fine (mémé was highlighted).

What was the output of HTMLStripCharFilterFactory in analysis.jsp?
In my analysis.jsp, I got ça va mémé ?.

Koji


Kundig, Andreas wrote:
 Hello

 I indexed an html document with a decimal HTML Entity encodings: the 
 character é (e with an acute accent) is encoded as #233; The exact content 
 of the document is:

 htmlbody#231;a va m#233;m#233; ?/body/html

 A search for 'mémé' returns no document. If I put the line above in solr 
 admin's analysis.jsp it also doesn't match mémé. There is only a match if I 
 replace #233; by é .

 This is how I configured the fieldType:

 fieldType name=text_fr class=solr.TextField positionIncrementGap=100
   analyzer
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType

 I tried avoiding the problem by using the MappingCharFilterFactory:

 fieldType name=text_fr class=solr.TextField positionIncrementGap=100
   analyzer
 charFilter class=solr.MappingCharFilterFactory mapping=mapping.txt/
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType

 I put the file mapping.txt in the conf directory. It contains just this:

 #233; = é

 This doesn't work either. How can I get this to work?
 (I am using solr 1.4.0)

 thank you
 Andréas Kündig

 World Intellectual Property Organization Disclaimer:

 This electronic message may contain privileged, confidential and
 copyright protected information. If you have received this e-mail
 by mistake, please immediately notify the sender and delete this
 e-mail and all its attachments. Please ensure all e-mail attachments
 are scanned for viruses prior to opening or using.




--
http://www.rondhuit.com/en/


World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.


HTMLStripCharFilterFactory does not replace #233;

2009-11-18 Thread Kundig, Andreas
Hello

I indexed an html document with a decimal HTML Entity encodings: the character 
é (e with an acute accent) is encoded as #233; The exact content of the 
document is:

htmlbody#231;a va m#233;m#233; ?/body/html

A search for 'mémé' returns no document. If I put the line above in solr 
admin's analysis.jsp it also doesn't match mémé. There is only a match if I 
replace #233; by é .

This is how I configured the fieldType:

fieldType name=text_fr class=solr.TextField positionIncrementGap=100
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

I tried avoiding the problem by using the MappingCharFilterFactory:

fieldType name=text_fr class=solr.TextField positionIncrementGap=100
  analyzer
charFilter class=solr.MappingCharFilterFactory mapping=mapping.txt/
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

I put the file mapping.txt in the conf directory. It contains just this:

#233; = é

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.


Re: HTMLStripCharFilterFactory does not replace #233;

2009-11-18 Thread Koji Sekiguchi

Your first definition of text_fr seems to be correct and should work
as expected. I tested it and worked fine (mémé was highlighted).

What was the output of HTMLStripCharFilterFactory in analysis.jsp?
In my analysis.jsp, I got ça va mémé ?.

Koji


Kundig, Andreas wrote:

Hello

I indexed an html document with a decimal HTML Entity encodings: the character é (e 
with an acute accent) is encoded as #233; The exact content of the document is:

htmlbody#231;a va m#233;m#233; ?/body/html

A search for 'mémé' returns no document. If I put the line above in solr admin's 
analysis.jsp it also doesn't match mémé. There is only a match if I replace 
#233; by é .

This is how I configured the fieldType:

fieldType name=text_fr class=solr.TextField positionIncrementGap=100
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

I tried avoiding the problem by using the MappingCharFilterFactory:

fieldType name=text_fr class=solr.TextField positionIncrementGap=100
  analyzer
charFilter class=solr.MappingCharFilterFactory mapping=mapping.txt/
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

I put the file mapping.txt in the conf directory. It contains just this:

#233; = é

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

  



--
http://www.rondhuit.com/en/