Re: [Ohrrpgce] Binary data in RELOAD to XML?

Mike Caron Sun, 22 Aug 2010 10:24:35 -0700

On 8/22/2010 3:23 AM, Ralph Versteegen wrote:

On 22 August 2010 08:14, Mike Caron<caron.m...@gmail.com>  wrote:

On 8/21/2010 3:55 PM, Ralph Versteegen wrote:


On 21 August 2010 05:22, Mike Caron<caron.m...@gmail.com>    wrote:


On 8/20/2010 1:15 PM, Ralph Versteegen wrote:


On 21 August 2010 05:11, James Paige<b...@hamsterrepublic.com>      wrote:


On Fri, Aug 20, 2010 at 12:55:50PM -0400, Mike Caron wrote:


On 8/20/2010 12:52 PM, Ralph Versteegen wrote:


On 21 August 2010 04:43, Mike Caron<caron.m...@gmail.com>        wrote:


On 8/20/2010 12:41 PM, Ralph Versteegen wrote:


On 21 August 2010 03:58, Mike Caron<caron.m...@gmail.com>
  wrote:


On 8/20/2010 11:45 AM, Ralph Versteegen wrote:


On 21 August 2010 02:24, Mike Caron<caron.m...@gmail.com>
  wrote:


On 8/20/2010 9:58 AM, Ralph Versteegen wrote:


Currently reload2xml can't properly export binary data stored
in
strings in reload files. It would be nice to be able to hand
edit
xml
and convert back. What's the preferred way to write
binary?&#nnn;
escape codes, or that Base64 stuff? (Is that actually part of
the
xml
standard?)


Strictly speaking, we only really need to escape characters
below
32.
Everything else should be okay to write out.


Follow-up question: what about bytes above 127? Is there a chance
of
confusing xml parsers into guessing UTF8 encoding? I suppose it
doesn't matter too much, since we can force libxml2 to read with
ASCII
encoding (but it would be nice to know that we have to do so).


The best way to handle this, actually, would be to add the XML
header,
which
allows you to specify the encoding. In our case, it should look
something
like:

<?xml version="1.0" encoding="iso-8859-1"?>

In light of that, I would prefer the&#nnnn; syntax, since it
means
that
no
one has to do any extra work to process the resulting XML file.
If
you
used
Base64 (which has nothing to do with XML), you'd have to mark it
as
such
in
order to distinguish it from a regular string that just so
happens
to
look
like Base64.


OK, cool. I'll add that then soonish (unless you jumped on the
chance to
do so).


No, go ahead.


Change of plan required.

' "&#0;" is not permitted, however, as the null character is one of
the control characters excluded from XML, even when using a numeric
character reference.[14] An alternative encoding mechanism such as
Base64 is needed to represent such characters. ' - Wikipedia

And I just confirmed libxml2 spits.


Damn. I guess that kind of makes sense, though.

Maybe... emit a<null/>        element instead? Ugh.

Or, Base64 it. Sigh.


Well, I know next to nothing about XML, what's the idiomatic way to
do
that? Add a special attribute to nodes containing base64-encoded
data,
like<foo base="64">assdfasdf234</foo>        ?


No, the proper way to do this is something like this:

<basenode xmlns:reload="http://hamsterrepublic.com/RELOAD";>
    <foo reload:encoding="base64">...</foo>
</basenode>

Note: that namespace URL doesn't have to exist, it just has to be
unique.


At first I thought you may be pulling my leg and going overboard...
but such is XML!


Actually, we could just cheese it and say:

<root xmlns:reload="uri:reload">

But, that's considered bad form.

The only reason I'm suggesting using namespaces is because the structure
and
the data are intermingled. How is xml2reload supposed to know if
base="64"
means that the node is encoded, or it represents a strangely-based
number,
or that all your base count up to 64?


reload:base="64" would be perfectly unambiguous if it were agreed
upon, though. But I agree it would be pretty weird.


It would be, yes. But, then when we add some other encoding for some
reason... etc etc

Another, more XMLish way, might be to encode documents like this:

<RELOADDocument>
    <node>
        <name>root</name>
        <type>null</type>
        <children>
            <node>
                <name>mynode</name>
                <type>string</type>
                <content>123</content>
                <children>...</children>
            </node>
        </children>
    </node>
</RELOADDocument>

Now who's pulling whose leg? ;)


Ugh I hate XML (and libxml) so much. I volunteered for this when I
thought it was going to be a quick fix with&#nnn; but now I'm in XML
deeper than I ever hoped to get in my life.


Should have taken the blue pill...

Anyway some more problems I'd like<s>approval</s>    suggestions on:
* null-name nodes? We already use these in RELOAD documents. The
solution I came up with is to write the name as 'reload:_' and convert
it back to a null name in xml2reload (while still being able to
distinguish the null-name nodes libxml2 inserts). Was this a good
choice?


Sure, that seems like a reasonable solution.

* strings with leading/trailing whitespace.


Whitespace is a bitch in XML. I had quite a bit of trouble, back when I did
the plotscripting dictionary, since hssed wanted such particular spacing for
the help file!


I am quite certain now that using XML to represent RELOAD documents as
text was a horrible idea, and we should have used something very
simple instead, just like textbox export. I could have implemented
this hypothetical textual import/export in half the time I just spent
fixing all the problems with xml import/export. But it's too late :(

How would you represent arbitrarily recursive data structures in thetextbox export format?

Anyway, the whole point of even caring about XML at all was to givepeople a means of creating arbitrary RELOAD documents that they couldaccess by plotscripting. If this is causing you too much pain, we canswitch to another format. Like JSON, for example.

xml2reload currently
strips it, but it doesn't have to. My first idea was:
<foo reload:encoding="exact">    lots of
white   space<child>...</child>
          <child>...</child>
</foo>
As you can see, the first child if any has to be smack against the end
of the string, while others can be indented normally
If that weren't bad enough, there's a complication. If the entire
string is whitespace, libxml2 discards it. So my suggestion is to
enclose the string in quotation marks. As a bonus, you can discard
whitespace after the end.
<foo reload:encoding="quoted">" lots of
white   space   "
          <child>...</child>
          <child>...</child>
</foo>


I propose a slightly different idea:

<foo><reload:ws>    lots of
white   space</reload:ws>
          <child>...</child>
          <child>...</child>
</foo>

Also, looking at this, there's no reason we can't shorten the namespace from
"reload" to "r". The important part is the URI.


Great.

<foo><r:ws>    lots of
white   space</r:ws>
          <child>...</child>
          <child>...</child>
</foo>

Anyway, the semantics of the<ws>  tag would be that everything inside it is
preserved exactly.


Doesn't quite work, because libxml2 will discard strings composed
entirely of whitespace, hence my suggested quotation marks

<foo><r:qs>"

    "</r:qs>
          <child>...</child>
          <child>...</child>
</foo>

Well, we could support both ws and qs ("quoted string") since the
quotes are unnecessary 98% of the time.


I still think you're over thinking this.

Actually, now that I think about it, XML already has a syntax for that:

<foo><![CDATA[  lots of
white   space   ]]>
          <child>...</child>
          <child>...</child>
</foo>

A CDATA section means "everything inside it is text, no processing should be
done to it at all, even if it looks like an entity or an element". So...
yeah.


A quick test reveals the text in CDATA sections is not transparently
handed to me by libxml2, and I have no patience for its useless
documentation.

Yeah, I agree, it's... really bad. However, I will do a bit of testingto figure out what's going on.


It's always possible that we could switch to a different library.

* Differentiating between zero-length strings and null valued nodes.
This would be solved by the above "quoted" encoding.


Just out of curiosity, do you have any reason for needing this distinction?
There really is no difference between:

<node></node>

and

<node />

In either RELOAD or XML.


I thought we already had "client" code that tested the type of a node
to determine whether it was as expected, but I was mistaken.

Perhaps we should remove the NodeType function and replace it with
NodeIsInt, etc, so that noone makes the mistake of testing the type of
a node. We don't need NodeIsString because everything is representable
as a string. Then a null string would test true for NodeIsNull, but I
guess a zero integer shouldn't.


Yeah, I agree with this.

After all that's solved... a perfect XML representation of a RELOAD
document is still likely to produce a different document when
converted, because of differences in the string table. Oh well.


Why do you care about the string table? The string table is supposed to be
an implementation detail. The consumer of a RELOAD document doesn't know or
care what order the strings are stored in, only that the contract of "I put
this string in now, I get this string out later" are maintained.


I was just a little annoyed that having gotten some documents to
apparently translate back and forth correctly, I couldn't actually be
sure.

??? if you're comparing two RELOAD documents, look at how reloadtest.basdoes it.

Relatedly, I'd like to point out that strings are never freed from the
string table - their refcounts can't even be decremented. It's a
pretty minor problem.

Hmm, it seems that you are correct. Fortunately, there is no memoryleak, since the table is cleaned up when the document is destroyed.

Make it http://hamsterrepublic.com/ohrrpgce/RELOAD and I will make it a
redirection to the reload docs.

Does the xmlns:reload thing have to go in each and every enclosing
parent node of a node that has base64?... or does it just go once in
the
root node?

---
James


I assume namespaces are meant to be in scope for all descendant nodes


That is correct. The scope of namespaces works pretty much as you would
expect.

_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org


_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org


_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org


_______________________________________________
Ohrrpgce mailing list
ohrrpgce@lists.motherhamster.org
http://lists.motherhamster.org/listinfo.cgi/ohrrpgce-motherhamster.org

Re: [Ohrrpgce] Binary data in RELOAD to XML?

Reply via email to