> > It mentions the ff fe bytes ( to indicate little-endian 
> > order)  I see at the beginning of my document.
> > 
> > The xml files contain the heading <?xml version="1.0" 
> > encoding="utf-16"?> specifying the encoding.


> > When I manually overwrite a document (left out the two bites 
> > and also the encoding) the index is being 'repaired' and only 
> > one hit is found with a search. It looks like the trailing 
> > bytes and the encoding are causing the unexpected search results.

 
> Whoow, must admit I learned something new today :-) Great research Æde, I 
> would have not guessed this from the top of my head. I also know lucene trunk 
> has done some parts which make use of \uffff kind of special chars, so am 
> wondering whether this might give collisions as well as what you encountered. 

Don't mention it ;) I learn also from you guys and with your help I got this 
far. I'm just happy that I could return the favour.
 
> Is it possible for you to store the documents as utf-8?

That is definitely an option I am going to explore. Unfortunately the 
application is developed by an 'external' party. I don't know whether they are 
able to change the xml (encoding) though, since I believe they are using 
'standard' windhoos components.

 --Æde


_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Reply via email to