Yes you're right. I was testing with analysis.jsp but it chokes on multibyte chars. I modified the jsp and set the encoding using request.setCharacterEncoding("UTF-8"); and it's working fine. Bug in analysis.jsp?
thanks, Stefan Oestreicher > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Yonik Seeley > Sent: Tuesday, July 15, 2008 6:29 PM > To: solr-user@lucene.apache.org > Subject: Re: WordDelimiterFilter splits at non-ASCII chars > > On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher > <[EMAIL PROTECTED]> wrote: > > as I understand the WordDelimiterFilter should split on > case changes, > > word delimiters and changes from character to digit, but it > should not > > differentiate between ASCII and multibyte chars. It does > however. The > > word "hälse" (german plural of "neck") gets split into "h", "ä" and > > "lse", which unfortunately renders this filter quite > unusable for me. > > Am i missing something or is this a bug? > > I'm using solr 1.3 built from trunk. > > Look for charset issues in communicating with Solr. I just > tried this with the "text" field via Solr's analysis.jsp and > it works fine. > > -Yonik >