Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
I suggest avoid illegal UTF-8 characters by pre-filtering your contentstream before loading. Unicode UTF-8(hex) U+07FFdf bf U+0800e0 a0 80 So there is no UTF-8 0x. It is illegal. Regards Am 27.06.2011 12:40, schrieb Markus Jelsma: Hi, I came across the indexing error below. It

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling wrote: > > So there is no UTF-8 0x. It is illegal. > you are wrong: it is legally encoded as a three byte sequence: ef bf bf

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:02, schrieb Robert Muir: On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling wrote: So there is no UTF-8 0x. It is illegal. you are wrong: it is legally encoded as a three byte sequence: ef bf bf Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right. But I wa

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling wrote: > Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right. > > But I was saying that UTF-8 0x (which is byte sequence "ff ff") is > illegal > and that's what the java.io.CharConversionException is complaining about. > "Invalid UTF-

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:35, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling wrote: Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right. But I was saying that UTF-8 0x (which is byte sequence "ff ff") is illegal and that's what the java.io.CharConversionException

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: > > correct!!! > but what i said, is totally different than what you said. you are still wrong.

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://ji

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
OK - re-reading your message it seems maybe that is what you were trying to say too, Robert. FWIW I agree with you that XML is rigid, sometimes for purely arbitrary reasons. But nobody has really helped Markus here - unfortunately, there is no easy way out of this mess. What I do to handle i

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Thomas Fischer
Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: > Hi, > > I came across the indexing error below. It happened in a huge batch update > from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace > the error back to a specific document. So i try my luck here: anyone seen

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread ramires
hı Its the same error I mentioned here http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html. Also if you use solr 1.4.1 there is no problem like that. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-1-indexing-error-Invalid-UTF-8-character-0x

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread lee carroll
Hi Markus I've seen similar issue before (but not with solr) when processing files as xml. In our case the problem was due to processing a utf16 file with a byte order mark. This presents itself as 0x to the xml parser which is not used by utf8 (the bom unicode would be represented as efbfbf i

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
On Monday 27 June 2011 16:33:16 lee carroll wrote: > Hi Markus > > I've seen similar issue before (but not with solr) when processing files as > xml. In our case the problem was due to processing a utf16 file with a > byte order mark. This presents itself as > 0x to the xml parser which is n

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it, th

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch <= 0xfdd0 && ch >= 0xfdef)) { pass; } Comment

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
Of course it doesn't work like this: use AND instead of OR! On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: > Hi all, thanks for your comments. I seem to have fixed it by now by simply > stripping away all non-character codepoints [1] by iterating over the > individual chars and checking them

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parse

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
Of course, i can't print the system bell and stuff like that in XML. I'll improve the method to get rid of non-printable control characters as well. On Monday 27 June 2011 18:16:08 Mike Sokolov wrote: > Markus - if you want to make sure not to offend XML parsers, you should > strip all characters