Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi, Thanks for your email. I'm somewhat confused by what you say. Through investigation, it seems html-sxml will decode entities, so long as they aren't within a HTML element attribute. Could you clarify on whether that default applies globally or just to attributes? Yes, sorry, I misread my own code :) The default is to _decode_ entities: #;1 (html-sxml quot;) (*TOP* \) And as you say, it currently doesn't just process attributes: #;2 (html-sxml div data-foo=\quot;\) (*TOP* (div (@ (data-foo quot; I'll fix this. Thanks for this Alex and sorry for taking so long to come back to you. When Philip first reported this we were running html-parser 0.5.0 on CHICKEN 4.7.0. We're currently upgrading to CHICKEN 4.9.0 and we were trying the latest html-parser, version 0.5.2. Unfortunately we've had a couple of problems: one with empty attributes and another that seems a bit more sinister. html-parser 0.5.0 works on both 4.7.0 and 4.9.0. html-parsers 0.5.1 and 0.5.2 don't work on either 4.7.0 or 4.9.0 so I've isolated the problem to changes introduced in 0.5.1. Empty attributes now seem to decode to the string (). During quot; deserialisation when inside an attribute, we seem to get data from earlier in the stream introduced: (define empty div data=\\empty/div) (define content br\r\nbr\r\ndiv data=\(sxml (@ (attr quot;12345quot;)) body)\div body/div) 0.5.0 - #; (html-sxml empty) (*TOP* (div (@ (data )) empty)) #; (html-sxml content) (*TOP* (br) \r\n (br) \r\n (div (@ (data (sxml (@ (attr quot;12345quot;)) body))) div body)) 0.5.1 - #; (html-sxml empty) (*TOP* (div (@ (data ())) empty)) #; (html-sxml content) (*TOP* (br) \r\n (br) \r\n (div (@ (data (sxml (@ (attr \\r\nbr\r\nbr12345\\r\nbr\r\nbr)) body))) div body)) The data in attr seems to be taken from data elsewhere: #; (html-sxml first\r\nbr\r\nsecond /div data=\(sxml (@ (attr quot;12345quot;)) body)\div body/div) (*TOP* (first \r\n (br) \r\n (second) (div (@ (data (sxml (@ (attr \second\r\nbr\r\n12345\second\r\nbr\r\n)) body))) div body))) Thanks for all your help maintaining this and, once again, sorry it took so long for us to put your newer versions into our code. Regards, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
On Fri, May 9, 2014 at 6:44 AM, Andy Bennett andy...@ashurst.eu.org wrote: Empty attributes now seem to decode to the string (). Fixed. During quot; deserialisation when inside an attribute, we seem to get data from earlier in the stream introduced: I couldn't reproduce this. Could you check with the latest fix? ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi, Empty attributes now seem to decode to the string (). Fixed. Thanks! :-) That works for me now: - #;4 (html-sxml empty) (*TOP* (div (@ (data )) empty)) - During quot; deserialisation when inside an attribute, we seem to get data from earlier in the stream introduced: I couldn't reproduce this. Could you check with the latest fix? Which CHICKEN are you using? I can reproduce it with 0.5.2 on 4.9.0rc1: - #;5 (html-sxml content) (*TOP* (br) \r\n (br) \r\n (div (@ (data (sxml (@ (attr \\r\nbr\r\nbr12345\\r\nbr\r\nbr)) body))) div body)) - ...but not with 0.5.2 on 4.8.0.4. - #;4 (html-sxml content) (*TOP* (br) \r\n (br) \r\n (div (@ (data (sxml (@ (attr quot;12345quot;)) body))) div body)) - With 0.5.3 on 4.9.0rc1 it seems to work: - #;5 (html-sxml content) (*TOP* (br) \r\n (br) \r\n (div (@ (data (sxml (@ (attr \12345\)) body))) div body)) - ...but perhaps it's worth chasing this down a bit further? Thanks for all your help with this. :-) Regards, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
On Fri, May 9, 2014 at 8:26 AM, Andy Bennett andy...@ashurst.eu.org wrote: Which CHICKEN are you using? I can reproduce it with 0.5.2 on 4.9.0rc1: Nevermind, I had only checked 0.5.3. I can see it in 0.5.2. ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
On Sat, Nov 23, 2013 at 11:19 AM, Jim Ursetto zbignie...@gmail.com wrote: Alex, Looks like there's a regression of sorts in html-parser 0.5.1. 0.5.0 #; (html-sxml foo bar/foo) (*TOP* (foo (@ (bar 0.5.1 #; (html-sxml foo bar/foo) Error: (cadr) bad argument type: () Oops, fixed. Arguably, empty attributes should result in a value of as per http://dev.w3.org/html5/markup/syntax.html#syntax-attr-empty ; for example, #; (html-sxml foo bar/foo) (*TOP* (foo (@ (bar although I'd also be satisfied with a return to the status quo ante, in which a null cdr signifies empty. Given that I can see pros and cons to both approaches, I'm inclined to leave as-is for now. -- Alex Jim On Sep 8, 2013, at 7:30 AM, Alex Shinn alexsh...@gmail.com wrote: On Thu, Sep 5, 2013 at 12:39 AM, Philip Kent phi...@knodium.com wrote: Hi Alex, Excellent! Thanks for looking into it and for the tip re custom parsers - I was trying to understand that code! It should work now, let me know if you have any problems. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Alex, Looks like there's a regression of sorts in html-parser 0.5.1. 0.5.0 #; (html-sxml foo bar/foo) (*TOP* (foo (@ (bar 0.5.1 #; (html-sxml foo bar/foo) Error: (cadr) bad argument type: () Arguably, empty attributes should result in a value of as per http://dev.w3.org/html5/markup/syntax.html#syntax-attr-empty ; for example, #; (html-sxml foo bar/foo) (*TOP* (foo (@ (bar although I'd also be satisfied with a return to the status quo ante, in which a null cdr signifies empty. Jim On Sep 8, 2013, at 7:30 AM, Alex Shinn alexsh...@gmail.com wrote: On Thu, Sep 5, 2013 at 12:39 AM, Philip Kent phi...@knodium.com wrote: Hi Alex, Excellent! Thanks for looking into it and for the tip re custom parsers - I was trying to understand that code! It should work now, let me know if you have any problems. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi Alex, Thank you for fixing this. Unfortunately I am not able to test this right now, but Andy (andyjpb) should be able to, I'll see what he says. Thanks, Philip From: Alex Shinn alexsh...@gmail.commailto:alexsh...@gmail.com Date: Sunday, 8 September 2013 13:30 To: Philip Kent phi...@knodium.commailto:phi...@knodium.com Cc: chicken-users@nongnu.orgmailto:chicken-users@nongnu.org chicken-users@nongnu.orgmailto:chicken-users@nongnu.org Subject: Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why? On Thu, Sep 5, 2013 at 12:39 AM, Philip Kent phi...@knodium.commailto:phi...@knodium.com wrote: Hi Alex, Excellent! Thanks for looking into it and for the tip re custom parsers - I was trying to understand that code! It should work now, let me know if you have any problems. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
On Thu, Sep 5, 2013 at 12:39 AM, Philip Kent phi...@knodium.com wrote: Hi Alex, Excellent! Thanks for looking into it and for the tip re custom parsers - I was trying to understand that code! It should work now, let me know if you have any problems. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi Alex, Thanks for your email. I'm somewhat confused by what you say. Through investigation, it seems html-sxml will decode entities, so long as they aren't within a HTML element attribute. Could you clarify on whether that default applies globally or just to attributes? Thanks, Philip From: Alex Shinn alexsh...@gmail.com Sent: 04 September 2013 03:51 To: Philip Kent Cc: chicken-users@nongnu.org Subject: Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why? On Tue, Sep 3, 2013 at 11:19 PM, Philip Kent phi...@knodium.commailto:phi...@knodium.com wrote: Hi all, I noticed an issue today with the html-parser egg, where it does not seem to decode entities within an attribute of an element, I have included an example below. #;14 (html-sxml div data-foo=\quot;\) (*TOP* (div (@ (data-foo quot; Expected: (*TOP* (div (@ (data-foo \ I was wondering if anyone could provide some thoughts as to why this might be happening? I have taken a look at the html-parser egg but have not seen much (but then this goes far beyond my knowledge of scheme!) html-parser processes entities, but the default for html-sxml is just to leave the encoded as-is. I'm not sure if that's the best default, but will at least provide a convenient option to get the decoded strings. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi Alex, Excellent! Thanks for looking into it and for the tip re custom parsers - I was trying to understand that code! Philip From: Alex Shinn alexsh...@gmail.com Sent: 04 September 2013 14:00 To: Philip Kent Cc: chicken-users@nongnu.org Subject: Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why? On Wed, Sep 4, 2013 at 8:23 PM, Philip Kent phi...@knodium.commailto:phi...@knodium.com wrote: Hi Alex, Thanks for your email. I'm somewhat confused by what you say. Through investigation, it seems html-sxml will decode entities, so long as they aren't within a HTML element attribute. Could you clarify on whether that default applies globally or just to attributes? Yes, sorry, I misread my own code :) The default is to _decode_ entities: #;1 (html-sxml quot;) (*TOP* \) And as you say, it currently doesn't just process attributes: #;2 (html-sxml div data-foo=\quot;\) (*TOP* (div (@ (data-foo quot; I'll fix this. What I was referring to before is that you can customize what is done with entities with (make-html-parser 'entity: (lambda (name) ...)) and can customize non-default entity names: (make-html-parser 'entities: '((quot . \) ...)) but again, these are currently ignored in attributes. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
[Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
Hi all, I noticed an issue today with the html-parser egg, where it does not seem to decode entities within an attribute of an element, I have included an example below. #;14 (html-sxml div data-foo=\quot;\) (*TOP* (div (@ (data-foo quot; Expected: (*TOP* (div (@ (data-foo \ I was wondering if anyone could provide some thoughts as to why this might be happening? I have taken a look at the html-parser egg but have not seen much (but then this goes far beyond my knowledge of scheme!) DerGuteMoritz mentioned on IRC that htmlprag behaves the same way. Any help you can give would be appreciated greatly! Thanks, Philip Kent ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] html-sxml (html-parser egg) does not decode entities in html attributes, ideas why?
On Tue, Sep 3, 2013 at 8:51 PM, Alex Shinn alexsh...@gmail.com wrote: html-parser processes entities, but the default for html-sxml is just to leave the encoded as-is. I'm not sure if that's the best default, I'm not going to suggest that this is a major problem, especially since you are not claiming html-parser conforms to any particular standard, and the docs clearly indicate its pragmatic focus. But just for the record, if you wanted to be an XML-1.1-conformant processor, you would have to normalize attribute values, which includes dereferencing character entities: http://www.w3.org/TR/xml11/#AVNormalize As for the non-XML varieties of HTML, well ... life is too short to go digging into all that hoary SGML stuff. Did that once upon a time ... but I was younger then, and thought markup languages were the greatest thing since sliced bread ;-) -- Matt Gushee ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users