Michael Lazzaro wrote:
So, is anyone working on a P6ML, and/or is there any discussion/agreement of what it would entail?

Imho P6ML is a bad idea, if it means what I think it means (creating a parser for quasi-MLs). People will laugh at our folly, and rightly so for trying to be able to parse all the horrors of the world in a sensical manner will lead to the same madness that happened with HTML. People will also hate us, and rightly so, for increasing tolerance for that kind of behaviour goes against the work accomplished over the past five years.


There are a number of pockets of bugosity that still produce broken XML, but they are being quenched one by one. Being too kind to them will only encourage them. As someone that works with XML every single second of my work time (and much of my fun time), I can only too well understand the frustration of developers faced with other people's buggy output and do want to help. But as someone that also had to parse other people's random formats before we had XML, I would like to stress strongly the fact that the current situation is *much* better than it was. Encouraging people to produce broken data by making efforts in that area at more or less language level visibilities is a step backwards ("Oh, it's broken but they use Perl so it doesn't matter").

If it is creating a /toolset/ to make recuperating data from a quasi-XML (aka tag soup) then it is an interesting area of research. I can think of two approaches:

- have a parametrisable XML grammar. By default it would really parse XML, and barf with extreme prejudice on errors. However individual rules will be relaxable and modifiable to accept different, possibly slightly broken, input. This is imho the least desirable approach.

- base a quasi-parser on something that does quasi-parsing well, namely an HTML parser, which would be wrapped to look like an XML parser but would be able to correct most typical problems (poorly defined entities, missing end tags, encoding errors, etc). Advantages are: a) it addresses 98% of existing problems, b) trying to solve the remaining issues in any non ad hoc manner is suicidal, c) can be pointed to to developers in trouble, and d) has very low general public visibility. Oh, and e) the perl-xml community is already on it, expect something in the month to come.

Either way, I really think it shouldn't be called P6ML.

--
Robin Berjon <[EMAIL PROTECTED]>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488



Reply via email to