Re: Standalone html parser
Anees Shaikh [EMAIL PROTECTED] writes: I'm trying to use the code in html-parse.c (v1.7) in standalone mode Excellent! For some reason, img src=... tags are recognized but then skipped almost every time they are encountered. When using the full program and recursive retrieve, the images are in fact retreived so it seems that the parser does work correctly when not in standalone mode. It seems that the following condition is met when parsing img tag attributes /* Establish bounds of attribute name. */ attr_name_begin = p;/* foo bar ... */ /* ^*/ while (NAME_CHAR_P (*p)) ADVANCE (p); attr_name_end = p; /* foo bar ... */ /* ^ */ if (attr_name_begin == attr_name_end) goto backout_tag; Can someone shed some light on this? For some reason, the parser does not advance past the attribute name. Try going into the debugger and printing the value of P. You should find out why the parser refuses to advance beyond attr_name_begin. Perhaps it thinks it has reached the end of file? (Are you calling it with the proper text length?) Perhaps the text is corrupted due to another bug in your program and the attribute name is invalid? A number of things could be wrong. When I wrote the parser, I primarily tested it in standalone mode, so it should work thus.
Re: Q: (problem) wget on dos/win: question marks in url
Did you try putting quotes around the URL? Don't have time to test at the moment, but my memory is that I used quoted URL in (this or another) DOS app and it did work. Rick On Fri, 29 Jun 2001, Reto Kohli wrote: hello list, i ran across a problem today which is so obvious that i'm wondering why there is no built-in workaround (yet?) in the windows port.. (i am not subscribed to this list, so please flame me at [EMAIL PROTECTED]; thanks! ;) consider the following wget call (and think dos:) wget http://mydomain.org/index.html?foo=bar well? -- dos (and your average windows, too) does of course not allow you to write files with question marks in their filenames! so, wget will complain Cannot write to 'mydomain.org/index.html?foo=bar', which is sadly true, and i will never get those wonderful pages.. (b.g. die die die! ;) -- can this be circumdone without recompiling wget? -- any suggestions? thank you very much and sorry if this has already been asked 1e99 times. ** please cc: any reply to [EMAIL PROTECTED] ** gimi -- 3d animation and video postproduction mailto:[EMAIL PROTECTED] http://www.psico.ch/ Rick Palazola Scientific Software Engineer Mutant Mouse Informatics Coordinating Center Project Site: http://www.jax.org/mmrc/icc/index.html email: [EMAIL PROTECTED] phone: 207-288-6440 The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609-1500
Re: Q: (problem) wget on dos/win: question marks in url
On 29 Jun 2001, at 8:13, Rick Palazola [EMAIL PROTECTED] wrote: On Fri, 29 Jun 2001, Reto Kohli wrote: consider the following wget call (and think dos:) wget http://mydomain.org/index.html?foo=bar well? -- dos (and your average windows, too) does of course not allow you to write files with question marks in their filenames! so, wget will complain Cannot write to 'mydomain.org/index.html?foo=bar', which is sadly true, and i will never get those wonderful pages.. (b.g. die die die! ;) Did you try putting quotes around the URL? Don't have time to test at the moment, but my memory is that I used quoted URL in (this or another) DOS app and it did work. The following characters are not legal in DOS filenames, even long (VFAT) filenames: \ / : * ?|
Re: Standalone html parser
So I think the problem is with malformed img tags. The parser fails if the tag is of this form: img src=/library/homepage/images/curve.gif alt= border=0 / Note the end of the tag is closed with / instead of just as in the spec. When the parser finds the / it thinks it sets attr_name_begin to the / and then attr_name_end gets set to the same thing. If I edit the html file to change the tag to: img src=/library/homepage/images/curve.gif alt= border=0 it is recognized correctly. Unfortunately in this case the parser also seg faults in the call to strlen() in the array_allowed() function. I haven't looked closely at this yet but it only shows up when a follow_tags list is passed to map_html_tags. That is, if you use NULL pointers for follow_tags and follow_attrs, there is no seg fault. I was asking the parser to just tell me about img tags. This problem with img tags seems to be quite common (redhat.com, ibm.com, microsoft.com) maybe due to some authoring tools. Thanks. -- Anees For some reason, img src=... tags are recognized but then skipped almost every time they are encountered. When using the full program and recursive retrieve, the images are in fact retreived so it seems that the parser does work correctly when not in standalone mode. It seems that the following condition is met when parsing img tag attributes /* Establish bounds of attribute name. */ attr_name_begin = p;/* foo bar ... */ /* ^*/ while (NAME_CHAR_P (*p)) ADVANCE (p); attr_name_end = p; /* foo bar ... */ /* ^ */ if (attr_name_begin == attr_name_end) goto backout_tag; Can someone shed some light on this? For some reason, the parser does not advance past the attribute name. Try going into the debugger and printing the value of P. You should find out why the parser refuses to advance beyond attr_name_begin. Perhaps it thinks it has reached the end of file? (Are you calling it with the proper text length?) Perhaps the text is corrupted due to another bug in your program and the attribute name is invalid? A number of things could be wrong. When I wrote the parser, I primarily tested it in standalone mode, so it should work thus.
Re: Standalone html parser
Anees Shaikh [EMAIL PROTECTED] writes: So I think the problem is with malformed img tags. The parser fails if the tag is of this form: img src=/library/homepage/images/curve.gif alt= border=0 / [...] This problem with img tags seems to be quite common (redhat.com, ibm.com, microsoft.com) maybe due to some authoring tools. That's supposed to be legal XML and some people are using it for XHTML compliance -- the final / says that the tag is closed immediately. I plan to fix the parser to not barf on it. Note the end of the tag is closed with / instead of just as in the spec. When the parser finds the / it thinks it sets attr_name_begin to the / and then attr_name_end gets set to the same thing. Yes. If it weren't for the XML novelties, it would be a feature. Unfortunately in this case the parser also seg faults in the call to strlen() in the array_allowed() function. Be careful that you're correctly consing up the arrays (they have to be NULL-terminated) and that your stack isn't corrupted or something like that. Also, I've recently fixed an important bug in DO_REALLOC_FROM_ALLOCA: Index: wget.h === RCS file: /pack/anoncvs/wget/src/wget.h,v retrieving revision 1.23 retrieving revision 1.25 diff -u -r1.23 -r1.25 --- wget.h 2001/05/27 19:35:15 1.23 +++ wget.h 2001/06/26 09:48:51 1.25 @@ -231,24 +231,24 @@ { \ /* Avoid side-effectualness. */ \ long do_realloc_needed_size = (needed_size); \ - long do_realloc_newsize = 0; \ - while ((sizevar) (do_realloc_needed_size)) { \ -do_realloc_newsize = 2*(sizevar); \ + long do_realloc_newsize = (sizevar); \ + while (do_realloc_newsize do_realloc_needed_size) { + \ +do_realloc_newsize = 1; \ if (do_realloc_newsize 16) \ do_realloc_newsize = 16; \ -(sizevar) = do_realloc_newsize;\ }\ - if (do_realloc_newsize) \ + if (do_realloc_newsize != (sizevar)) \ { \ if (!allocap)\ XREALLOC_ARRAY (basevar, type, do_realloc_newsize); \ else \ { \ void *drfa_new_basevar = xmalloc (do_realloc_newsize);\ - memcpy (drfa_new_basevar, basevar, sizevar); \ + memcpy (drfa_new_basevar, basevar, (sizevar));\ (basevar) = drfa_new_basevar; \ allocap = 0; \ } \ + (sizevar) = do_realloc_newsize; \ } \ } while (0)
Re: xhtml? Re: Standalone html parser
Anees Shaikh [EMAIL PROTECTED] writes: Hrvoje, you mentioned that you planned to modify the parser to handle these tags. Any ideas on timetable? How about now? :-) I have created a simple patch that deals with this. However, preliminary testing indicated a problem with the semantics. Normally, Wget allows this: img src=/foo/bar.gif With this patch, this causes the tag to back out (i.e. be proclaimed invalid) because it expects '/' to be followed by ''. Of course, I could make the attribute value swallow everything until whitespace or '' and *then* look for a '/', but what about this: img src=foo/ Is the '/' a part of `foo' or does it close the tag? Is the whitespace preceding the closing '/' mandatory? If so, we can make it so that the following all differ: img src=foo/ # img src=foo/ img src=foo /# img src=foo/img img src=foo/ / # img src=foo//img Any opinions? Advice? Standards-lawyer-speak?