Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border

Daniel Veillard Wed, 21 Jun 2006 07:55:46 -0700

On Wed, Jun 21, 2006 at 04:29:56PM +0200, Cyrill Osterwalder wrote:
> Hi all
> 
> After some more research I believe to have found the reason for the
> problem with the CDATA parsing. In case PARSE_HTML_RECOVER is true, the
> following criteria in htmlParseTryOrFinish() is not enough for calling
> htmlParseScript():
> 
> /*
>  * Handle SCRIPT/STYLE separately
>  */
> if ((!terminate) &&
>     (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) < 0))
>         goto done;
> htmlParseScript(ctxt);
> 
> 
> This code makes sure that there is an end tag starting somewhere in the
> buffer that is going to be processed by htmlParseScript(). However, in
> recovery mode, htmlParseScript() will consume the "</" characters if the
> real CDATA end tag is not fully inside the current chunk (like described
> in the problem report).


  True. I was think about something like that. This is all due to 
script and style having different parsing constraints.
  Why do you use PARSE_HTML_RECOVER ? The parser is already doing recovery
mode to some extend without them (I mean the HTML parser :-).

> I don't have a patch recommendation for the moment but I see two
> possibilities:
> 
> a) htmlParseTryOrFinish() could guarantee that the buffer contains the
> desired close tag (or terminate is true). I guess that this could be
> done using multiple htmlParseLookupSequence() calls and checking for the
> tag name in a loop...?

  Hum, well we could check for the current element and make 2 specific
tests in that case. This would be very hard anywy people are gonna come
with '</ style' or '</foo> and expect taht to close the open tag, and
 'style "</" style' and expect to not close it...
  
> b) htmlParseScript would have to be more powerful in order to recognize
> that it is trying to do xmlStrncasecmp() on an incomplete tag string. In
> that case it should break and be called again by htmlParseTryOrFinish().
> That on the other hand would have to be more careful with the switch to
> the end tag processing after the call to htmlParseScript().

  Not sure it's much better

> Possibility a) looks better to me and might try to implement a patch
> example.

  You can try, but it's all very messy IMHO, I will take patches if not
obviously broken (could be a good idea to provide examples for the test
suite too).

   thanks

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] HTML Parser problems with chunk parser if HTML keywords overlap chunk border

Reply via email to