In org.apache.lucene.demo.HTMLDocument you need to change the input stream to use a different encoding. Replace the fis with this:
fis = new InputStreamReader(new FileInputStream(f), "UTF-16"); -----Original Message----- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Friday, September 24, 2004 9:25 PM To: Lucene Users List Subject: Re: demo IndexHTML parser breaks unicode? Sorry, that didn't cure it. Again, anyone want to point me to the quickest replacement HTML parser (that's unicode clean)? Thanks, Fred At 03:17 PM 9/24/2004, you wrote: >On Friday 24 September 2004 19:58, Fred Toth wrote: > > > I've got unicode in my source HTML. In particular, within meta tags, > > and it's getting broken by the indexer. Note that I'm not trying to > > query on any of this, just store and retrieve document titles with > > unicode characters. > >Please try again with the code from CVS, Christoph Goller committed a fix >for this problem (at least I think it was this problem) 1-3 weeks ago. > >Regards > Daniel > >-- >http://www.danielnaber.de > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]