On Fri, Aug 10, 2012 at 11:58 PM, Gary Gregory <garydgreg...@gmail.com> wrote: > On Fri, Aug 10, 2012 at 4:27 PM, Niall Pemberton > <niall.pember...@gmail.com>wrote: > >> On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <garydgreg...@gmail.com> >> wrote: >> > Hi All: >> > >> > Does anyone have expertise with BOMInputStream? >> > >> > I know that some XML parsers (like the one shipped with the Oracle JRE) >> do >> > not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using >> > BOMInputStream is supposed to fix the issue. >> > >> > These tests I added and @Ignore'd fail: >> > >> > - >> > >> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be() >> > - >> > >> org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le() >> > >> > More basic tests do work: >> > >> > - >> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be() >> > - >> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le() >> > >> > When I look at the Oracle JRE (which uses a copy of Xerces) I see code to >> > deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset >> to >> > UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document. >> > >> > Any thoughts? >> >> Hi Gary, >> >> I enabled the test and ran them. I'm a bit confused about what the >> issue is because the lines that use the BOMInputStream to *skip* the >> UTF-32 BOM do not fail for me: >> >> parseXml(new BOMInputStream(createUtf32BeDataStream(data, >> true), ByteOrderMark.UTF_32BE)); >> parseXml(new BOMInputStream(createUtf32LeDataStream(data, >> true), ByteOrderMark.UTF_32LE)); >> >> whereas the lines after those that do not use any Commons IO components >> fail: >> >> parseXml(createUtf32BeDataStream(data, true)); >> parseXml(createUtf32LeDataStream(data, true)); >> >> So this just means that the XML parser doesn't deal with UTF-32 BOM. >> >> Really though the BOMInputStream stream doesn't provide anything that >> helps parse the XML properly - it has two purposes 1) BOM detection >> and 2) BOM removal/skipping. >> >> What we do have in Commons is XMLInputStream - this uses various >> techniques to detect encoding, including using BOMInputStream to try >> BOM detection and then uses that encoding to with a Reader to process >> the bytes properly >> > > Ok, thank you Nial, my initial experiment with XMLStreamReader works, I'll > continue in this direction at work.
Yes sorry, meant XMLStreamReader! Niall > Gary > >> >> Niall >> >> > Thank you, >> > Gary >> > >> > -- >> > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org >> > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0 >> > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK >> > Blog: http://garygregory.wordpress.com >> > Home: http://garygregory.com/ >> > Tweet! http://twitter.com/GaryGregory >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> For additional commands, e-mail: dev-h...@commons.apache.org >> >> > > > -- > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0 > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK > Blog: http://garygregory.wordpress.com > Home: http://garygregory.com/ > Tweet! http://twitter.com/GaryGregory --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org