If your parsing html files have a check in lucene to see the terms that are index and see if you can spot any joined terms.
The PDF parser as you can see from the other mail is from www.pdfbox.org and i highly recommend it (thanks again Ben!) On Wed, 14 Aug 2002, Maurits van Wijland wrote: > Keith, > > I haven't noticed the problem with the Parser...but you trigger me > by saying that you have a PDFParser!!! > > Are you able to contribute this PDFParser?? > > Maurits. > ----- Original Message ----- > From: "Keith Gunn" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, August 14, 2002 9:46 AM > Subject: problems with HTML Parser > > > > Has anyone noticed that the HTML Parser that comes with > > Lucene joins terms together when parsing a file. > > I used to think it was my PDFParser but after fixing that > > I found out it was the HMTLParser. > > > > I managed to find a replacement parser that doesn't join terms. > > > > Just wondered if anyone had come across this problem?? > > > > > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>