Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-10 Thread Robert Collins
On Fri, 2005-06-10 at 10:24 +1000, [EMAIL PROTECTED] wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
...
> I thought that libxml2 was widely accepted, used by gnome, etc.
> I checked the manpage and nowhere does it say "this parser sucks",
> maybe I should submit a documentation bug?

libxml2 can do it just fine: it allows overriding of document encoding.
I was referring to the perl parser you are using. If that happens to be
bindings to libxml2, then they are incomplete.

> Hoping for a quick fix, I tried the expat based parser instead
> (which also has perl bindings) with the following program:
> 
...
> Sadly, this parser gives output in a different format so changing
> parser has now broken the rest of my program *SOB*. On the wilde
> chance of an undocumented feature I went back to the original libxml2
> based parser and tried inserting options from the expat bindings:

?? you mean the structured data you get back is different ? eeek!
...
> Frighteningly enough, this actually works... 
> 
> Woo hoo! I got XML to actually work!
>  
> > convert the (probably cp-1252) text into utf-8, then parse it. or set a
> > encoding in the header, it looks like the perl bindings suck a certain
> > amount.
> 
> By the looks of it, the bindings are better than the manpage
> is willing to admit. I still don't like XML because it is nutty
> that it should screw up so easily. My feeling is that if this
> sort of technology cannot make things EASIER to deal with then
> might as well go with something that does.

I think you are conflating the problem here. If you had a ascii, tab
delimited format and someone gave you an EBCDIC, tab delimited format,
you'd have to tell your parser to use EBCDIC. 

GIGO - oldest rule in the book.

Rob

-- 
GPG key available at: .


signature.asc
Description: This is a digitally signed message part
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-10 Thread Rev Simon Rumble

This one time, at band camp, [EMAIL PROTECTED] wrote:

> My vote still goes to plain ASCII with single character delimiters
> (e.g. TAB or one of the DLE/DCn set) because of simplicity.

And you will work out what character set is in use how, exactly?

-- 
Rev Simon Rumble <[EMAIL PROTECTED]>
www.rumble.net

The Tourist Engineer
Because geeks travel too.
http://engineer.openguides.org/

Inflation is the one form of taxation that can be imposed
without legislation.

- Milton Friedman


signature.asc
Description: Digital signature
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread telford
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Jun 10, 2005 at 09:48:12AM +1000, Robert Collins wrote:

> Uhm. Sure. Heres a Gig of download, your 500K of usable detail can be
> found spread throughout it. 

I think that even with perfectly well formed XML you will find that the
ratio of usable data to cruft is worst than 50%.

More importantly, if a crap data feed is the only data you can get
then the choice is deal with it or barf and die. Which is preferable?

> Seriously, XML itself is no more brittle than your ascii file, its what
> you put in it that makes a specific xml environment brittle or not. Its
> just SGML after all - which is precisely what HTML is. The parser you
> are using sucks - sorry, but thats the root of your problem.

I thought that libxml2 was widely accepted, used by gnome, etc.
I checked the manpage and nowhere does it say "this parser sucks",
maybe I should submit a documentation bug?

Hoping for a quick fix, I tried the expat based parser instead
(which also has perl bindings) with the following program:


#!/usr/bin/perl -w

use XML::Parser;
my $parser = new XML::Parser(Style => 'Debug');
my $doc = $parser->parsefile( "podcast.xml" );


Which gives this result:

Unrecognized character \xE2 at ./test2.pl line 4.


Same problem, less detailed error message... at least the manpage
for this parser does explain how to set an encoding so the following
program does work:

#!/usr/bin/perl -w

use XML::Parser;
my $parser = new XML::Parser(Style => 'Debug',
 ProtocolEncoding => 'ISO-8859-1');
my $doc = $parser->parsefile( "podcast.xml" );


Sadly, this parser gives output in a different format so changing
parser has now broken the rest of my program *SOB*. On the wilde
chance of an undocumented feature I went back to the original libxml2
based parser and tried inserting options from the expat bindings:


#!/usr/bin/perl -w

use XML::LibXML;

my $parser = XML::LibXML->new(ProtocolEncoding => 'ISO-8859-1');
$parser->recover(1);
$parser->pedantic_parser(0);
$parser->validation(0);
my $doc = $parser->parse_file( "podcast.xml" );
my $root = $doc->documentElement();


Frighteningly enough, this actually works... 

Woo hoo! I got XML to actually work!
 
> convert the (probably cp-1252) text into utf-8, then parse it. or set a
> encoding in the header, it looks like the perl bindings suck a certain
> amount.

By the looks of it, the bindings are better than the manpage
is willing to admit. I still don't like XML because it is nutty
that it should screw up so easily. My feeling is that if this
sort of technology cannot make things EASIER to deal with then
might as well go with something that does.

My vote still goes to plain ASCII with single character delimiters
(e.g. TAB or one of the DLE/DCn set) because of simplicity.

- Tel  ( http://bespoke.homelinux.net/ )
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iQIVAwUBQqjdsMfOVl0KFTApAQIUOQ/8DUq90zzBlQCWhzEWJOxTfQPOwcSgDh/H
QCwuKyvGBTnzI3UI6YbdDZ+gsezw/LcH8ltrsyq2QspyxwV0bkcweP9Tb0smRABp
Ooivq9mBGE5dIe3WysdObkh9d0ZOiUFqiFFCYIcJYEJOxCVBDyeOvRwW+31TefPx
GZmImrIktJmHxBBrpqbH8onRNK02cz/s6zXcKENYA/ixy7KzLwtVZiK5NX939E3F
yVYP5+OIgvU6fgGK8xeQOqOQyTAVUAzYOKsjarE7tvHvlGg84MwGGaWkGIh7C5gz
ZlvrE+dHKA8cgNfQl+goTLA214cKx3kAdADmEFVsz/509sacesKIlZAdOuuUDKxH
rC+Cl0akqhCtr9L8pd57LGMHNDmZxFmMXYRXAuzS3/WrrqnFaOVpjJ2yTpPVmR+r
4z6VGjg2SZFYa5YeXlCNsjm22bHpszNgUpv0UyBaE4v9egCf+EX3P1GRvCLkdGmK
mzjViA9qsO856uFCSsswWa/WmPrf9/Me3Lj3dxtp8uYKlqsaxGFlsoj7bGgBDMAU
tRB0UBS2npOMjzF4dODXh1VesFVLzHC5mY7UFOo9Em4aaodID3cjId+6Um+TsCI6
miZFBUA7EswTWRxdMNeo5xPpl9xLNeJOlfuUyBqCptB0VqSWhRAj+k9uUyh3t9li
StX7iRY3As8=
=Yp9f
-END PGP SIGNATURE-
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread Robert Collins
On Thu, 2005-06-09 at 20:10 +1000, [EMAIL PROTECTED] wrote:
...
> Thus if anyone is going to design a communications language it
> should be a robust and that means it can recover from problems
> and can guarantee resynchronisation from an arbitrary seek.
> XML doesn't live up to the promise of being a universal markup
> language because it is too annoying an too brittle.

Uhm. Sure. Heres a Gig of download, your 500K of usable detail can be
found spread throughout it. 

Seriously, XML itself is no more brittle than your ascii file, its what
you put in it that makes a specific xml environment brittle or not. Its
just SGML after all - which is precisely what HTML is. The parser you
are using sucks - sorry, but thats the root of your problem.

> By the way, how DO I get perl to read such a file?
> 
> Do I have to write my own parser?

convert the (probably cp-1252) text into utf-8, then parse it. or set a
encoding in the header, it looks like the perl bindings suck a certain
amount.

Rob

-- 
GPG key available at: .


signature.asc
Description: This is a digitally signed message part
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread telford
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Jun 10, 2005 at 02:14:05AM +1000, Jeff Waugh wrote:
> 
> 
> > 
> 
> Keep in mind that all forms of RSS are absolute abominations, most feeds are
> completely broken, and it has not encouraged anyone to use XML properly.

RSS is an example of people trying to make XML work for a real-world
application. No doubt someone read the XML hype and figured that it would
be good for the purpose of data transfer.

> XML is
> quite good in general. RSS and all its related muck (as well as HTML if
> we're being honest) gives it a bad name, given that most people experience
> XML through one of these formats first.

But HTML has been amazingly successful in the "real world" where imperfection
is a way of life. Generally speaking, if you send a browser bad HTML you still
get some sort of rendered output... usually not too far from what you expected.

It seems to me that XML is only useful when the same programmer has control
over the sender and the receiver. For example when a program saves to XML then
loads from XML again (such as gnumeric). This doesn't constitute data transfer,
it is really just a library that makes it easier to save and load
structured data.

> Look at Mark Pilgrim's feedparser (used in Planet) to see just how stupid it
> all is.

But what XML people call "broken" is more accurately described as "normal",
the whole idea of a markup language is to be flexible and to have ways to
cope with minor incompatibility between sender and receiver.

- Tel  ( http://bespoke.homelinux.net/ )
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iQIVAwUBQqjJ3MfOVl0KFTApAQLZUw//asb1PW7H4yOVL09cNuaekT6Kf7GJVJX1
5lvhG4aI2HizYFqwPv22BcsTu2H74xMDfQVLQ1qWTdIIccLItWih53fh5weWmT6B
FtI9BHaoLDlzGXY8ipggQvTpx8hDowDduRkTFaH+cfuF1obgoeFgkKZGDWgd48+x
ZyzjrD6WVsYpKqucqoNaqLnptGNxFFWu4vpfXCwbd/EgRQm3XsR+Ix7Q5Nfr5h1X
dRYhbewMSb360rQbT+jgSCltu0omJfPzNb2yLfvpWXk0Yi+bZlo6AslgtB4+At5a
Xij5mwbZorP2HTX+GBuYYgza1u22wrcf2SzBL8J/iKnmhIfbTc6aT0h5n9xbzIWB
jWYDB+yFk7lRpXT4LSqulTSLYihbYTXq2NMLn+hOBP9nzSRd428EsAGEIa2DZvsh
YlK5f8QqeBjKT0ylx5aOi3X60sn5GIwwrH6G/bTdr1ybRjI3GxWS1I3Iu/L5wPjt
msxAEeTij63G66hjlbGI+opyCqzAem1gkcwy5iOBz0PtU1ERerbdBCNnlyIwWPHW
rWW6b3ZzSAPMh5azqHQmQheBLqMx0GG1ONTMviz/G7PY02EusPms+FkZWNN/sfkg
WUyG3V+YWWL7RpvB5DLMuxR3YgcRjD/Rh385ALLyW/AXius7EwVVe8FymE+BWEzx
4Tu1EdyBuis=
=1jLn
-END PGP SIGNATURE-
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread telford
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Jun 09, 2005 at 09:13:49PM +1000, Jamie Honan wrote:

> I think the key is 'validating'...

If you checked my perl example, I specifically turn validation off
in an attempt to get the data to load. It didn't help.

> > I've got three answers to the above. Most importantly, you don't
> > use markup language for talking to a mars rover... you use a 
> > programming language and we all agree that programming languages
> > are brittle and always will be. Another (still significant)
> > point is that you can always take a robust language (e.g. simple
> > TAB delimited text file) and make it robust by adding a CRC or
 ~~
> > some sort of signature system... you cannot take a brittle
> > language and make it robust. 
> 
> I'm having trouble parsing this :) robust and robust?

Sorry about that, should have said that you can take a
robust language and deliberately make it brittle by adding
a CRC. There's nothing you can add to a brittle language
in order to make it robust. Thus a "standard" data transfer
language should be desgined to be robust and then have the
option for users to add a CRC or md5sum if required.

> That's got to be a different issue. You parse after you get
> known correct data ...

How is that possible? You have to parse what you get and
then figure out what it means. Having a program that believes
its inputs must be known correct before it will do anything
seems to contradict the most basic rules of data handling.

Thanks for the links...

- Tel  ( http://bespoke.homelinux.net/ )
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iQIVAwUBQqjF28fOVl0KFTApAQL6UxAAj5J+Ef8u+ksKBtzBPY/7EfqZx/U1AgSv
D+PcFC/cRCrTV4FdJZ7qCsWF9eoteffYPOfGOYz4Juz5k0mGjwJo+bipRy4WAjii
Wgzv6Mz8CdXveV//VCdqZZsoboGQc7nr4aWMBltHKKkf3fZsqoikVtwNxrIDiGAe
eCRFR8v6NtzcxI0Pk5Jrya9V2i91wpCzycyIynFJTd3jDguy3tWtULQ9u1KWVrJ1
E81c5qBsLfp0E6Wmj7861HNhGpIj8moi6lnMlPABhM01M+9jDQwoH0MrsIk9Y3x3
juibav2OEQb83aoHWrqZQKIftUDbNpje+ajDL1r/F75xl9xU9pA7fVsfJhQfNS+y
85SMbMxDBlFcIUA94y3Eqhtn8Be+/y2TO7+0URLZetKf34jjfXqiLLQoJgK4WddK
y3EsvYYAZGe8f8fX2Rm2bkmJ7h2v8s60fcp30OyOC7IoF0mV3k39ACsZnkoIHWfj
D4ipYVw+AC7zUGZ+mvEiqzzHxpsu4DgF9aO9b/HZhJaG+zjJu9d1ZYjOaXFEW32K
h0flILtzx/brAIn0UrwXPh+mklPm12uL9ULOnmUbCBcQ9jGtswlpBMV2fvewqUVJ
ElKLv09ICnVWPNWVsSXB3i8wPgUJmUk81jmJAzNP2FLXSuR1WwETCE0shqN0Wrfi
DF0+uY0Lurw=
=EVH7
-END PGP SIGNATURE-
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread Jeff Waugh


> 

Keep in mind that all forms of RSS are absolute abominations, most feeds are
completely broken, and it has not encouraged anyone to use XML properly. XML
is quite good in general. RSS and all its related muck (as well as HTML if
we're being honest) gives it a bad name, given that most people experience
XML through one of these formats first.

Look at Mark Pilgrim's feedparser (used in Planet) to see just how stupid it
all is.

- Jeff

-- 
GNOME Summit 2005: October 10th-12thhttp://live.gnome.org/Boston2005
 
   "It's weird being without white noise." - Catie Flick
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread Jamie Honan

> Yes I realise that in an ideal world the  tag would contain
> encoding information and yes I realise that in order to be correct UTF-8
> it must encode characters above 127 in a special way and this encoding

I'm not an xml fan nor an expert (calling Mike and others), but as I
understand things you have valid xml (which includes valid encoding) and
_other stuff_.

Plainly, you have _other stuff_ pretending to be xml. It wouldn't
pass a validating parser, thus it is not xml.

That's the whole point. It has to pass strict rules to be called xml.

> And here comes the gist of this rant...

Good, hope it's cathartic.

> The fundamental difference between a programming language and a markup
> language is that a programming language can have parser errors and
> syntax errors whereas a markup language cannot (by definition) have
> any errors at all under any circumstances. The parser for a markup

Not as I see it. The markup must pass tests as well.

> language must be fully robust to all possible inputs and although it
> certainly can result in various severity of WARNINGs but nothing must
> stop the parser.

OK. Present random bits to your black box, stand back and declare,
"Call yeself a parser eh, well parse this Jimmy!"

> Fundamentally, XML is crap as a markup language because it simply
> isn't possible to build a fully robust parser. Worse than that, you
> can't recover state (even approximately) in the presence of a damaged
> document, XML is brittle, as brittle as any programming language.

No, xml is just overhyped. That's not its fault. xml requires more
resources than csv files, but it does more.

> Let's make a simple comparison... suppose I do all my data transfer 
> by simple tab delimited ASCII files with one record per line.
> If a line gets damaged, I might lose that line, I might even lose
> the line after the damaged line but at least I have the rest  of the
> document. If I jump into a plain ASCII file at a random location then
> I can scan around the local area until I find the end of a line and
> I can resynchronise to the local records. This technique can be used
> to perform a fast binary search directly into an ASCII file that is
> sorted by line -- can you do this with XML? Of course not... your
> basic parse-state is broken the moment you seek to anywhere at all,
> and that state is perpetually unrecoverable because something that
> looks like a tag can exist within a string or you can have a CDATA
> or some other stupid thing.

I think the key is 'validating'...

> I've got three answers to the above. Most importantly, you don't
> use markup language for talking to a mars rover... you use a 
> programming language and we all agree that programming languages
> are brittle and always will be. Another (still significant)
> point is that you can always take a robust language (e.g. simple
> TAB delimited text file) and make it robust by adding a CRC or
> some sort of signature system... you cannot take a brittle
> language and make it robust. 

I'm having trouble parsing this :) robust and robust?

Maybe the difference between communication protocol and
markup?

> Finally, your brittle language
> still isn't full protection against a comms failure because
> sometimes a single bit flip (like turning "1" into "3") will
> have disastrous consequences to the command but will look fine to
> the parser. So the parser isn't real safety, it is at best a
> false sense of safety. You still need CRCs and the like.

That's got to be a different issue. You parse after you get
known correct data ...

> Thus if anyone is going to design a communications language it
> should be a robust and that means it can recover from problems
> and can guarantee resynchronisation from an arbitrary seek.
> XML doesn't live up to the promise of being a universal markup
> language because it is too annoying an too brittle.
> 
> By the way, how DO I get perl to read such a file?
> 
> Do I have to write my own parser?

Hack-o-matic. You think it's mac, yah? Maybe run it first
through something that converts to UTF-8 or whatever then try
to parse. But, fundamentally, you've been lied to. It isn't xml.

BTW:

A page of xml alternatives:
http://www.pault.com/pault/pxml/xmlalternatives.html

Some of my favourites, which you will hate because they
are similarly strict:

yaml (good perl support, very readable by humans)
http://www.yaml.org/

sexp
(two implementation versions - s expressions, think lisp data)

and my current belle-o-the-ball:
ubf

http://www.sics.se/~joe/ubf/site/home.html

ubf is stricter than xml, smaller and good for marshalling.

I'm glad you're cranky Telford, seeing someone else in a bad
mood makes me feel better.

Regards
Jamie
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Why XML bites and why it is NOT a markup language

2005-06-09 Thread Rev Simon Rumble
This one time, at band camp, [EMAIL PROTECTED] wrote:

> Sure enough there is a high-ascii item in there and some Mac user has
> no doubt used a proprietary bingle-bongle encoding for a single quote
> even though there is a perfectly good ASCII encoding for the same.

No there's not.  It's a different character from the tick mark '

> Never mind blaming the Mac user... they do that sort of thing, it's a
> fact of the universe, nothing will ever change a Mac user. However,
> what really shits me is that the XML parser dies totally and completely
> when it hits a single high-ascii character. This is with the "recover"
> flag set, and both "pedantic" and "validation" switched off. Basically
> it is running in the most lenient possible mode that it can possibly
> operate in and a single bad character still nails it.

It's not just Mac users.  MS Word also does this.  In fact, if they've 
copied-and-pasted from MS Word into IE, it should be in Unicode.  So 
the web form they're submitting it through is making the wrong 
assumption about the encoding, more likely.

Of course it doesn't help that UTF-8 _looks_ like ASCII most of the time 
in the English world, so the developer wasn't prodded to notice this.

Send a POLITE email to the ABC's webmasters.  They actually respond.  If 
you put the right keywords into it, it might even get forwarded to 
someone who understands it.

-- 
Rev Simon Rumble <[EMAIL PROTECTED]>
www.rumble.net

The Tourist Engineer
Because nerds travel too.
http://engineer.openguides.org/

 "The only intuitive interface is the nipple.
  After that, it's all learned. "
- attributed to Bruce Edigar


signature.asc
Description: Digital signature
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html