[MarkLogic Dev General] BOM char and UTF-16

Josh Warner-Burke Wed, 08 Feb 2012 13:49:13 -0800

I emailed about a week ago about a problem I was having with XCC and large
files.  I got some very good advice which said I needed to use
session.insertContent to get the file in.  I'm done with that conversion
but dealing with the resulting problems due to the change.


What I'm looking at right now is a file that is UTF-16 and begins with two
BOM characters - which I have learned are actually relevant in telling any
string parser/consumer what order the bytes in each pair will be...

I wrote some code that strips out the BOMs but it seems to screw the
encoding up altogether.  I also put in code to set the encoding to UTF16 in
the ContentCreateOptions.  Without stripping BOMs, I get this:
Invalid root text "&#255;&#254;" at [uri] line 1

To deal with UTF-16 don't you *need those BOMs?  What am I missing here?
FYI the first line of the files looks like:

<?xml version="1.0" encoding="UTF-16" standalone="yes"?>

So it's clearly utf-16.

There is some leeway in terms of how I create the Content object to feed to
insertContent - currently I'm treating it as a byte[] - but I could do
string conversion etc if that's what I need to do.  Any help is
appreciated.

-- 
Josh Warner-Burke
42SIX Solutions
(m): 410-493-4362
(e): jwbu...@42six.com
http://www.42six.com

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] BOM char and UTF-16

Reply via email to