Re: Standalone html parser

2001-06-29 Thread Hrvoje Niksic

Anees Shaikh [EMAIL PROTECTED] writes:

 I'm trying to use the code in html-parse.c (v1.7) in standalone mode

Excellent!

 For some reason, img src=...  tags are recognized but then skipped
 almost every time they are encountered.  When using the full program
 and recursive retrieve, the images are in fact retreived so it seems
 that the parser does work correctly when not in standalone mode.
 
 It seems that the following condition is met when parsing img
 tag attributes
 
   /* Establish bounds of attribute name. */
   attr_name_begin = p;/* foo bar ... */
   /*  ^*/
   while (NAME_CHAR_P (*p))
 ADVANCE (p);
   attr_name_end = p;  /* foo bar ... */
   /* ^ */
   if (attr_name_begin == attr_name_end)
 goto backout_tag;
 
 Can someone shed some light on this?

For some reason, the parser does not advance past the attribute name.
Try going into the debugger and printing the value of P.  You should
find out why the parser refuses to advance beyond attr_name_begin.

Perhaps it thinks it has reached the end of file?  (Are you calling it
with the proper text length?)  Perhaps the text is corrupted due to
another bug in your program and the attribute name is invalid?  A
number of things could be wrong.

When I wrote the parser, I primarily tested it in standalone mode,
so it should work thus.



Re: Q: (problem) wget on dos/win: question marks in url

2001-06-29 Thread Rick Palazola


Did you try putting quotes around the URL?  Don't have time to test at the
moment, but my memory is that I used quoted URL in (this or another) DOS
app and it did work.

Rick

On Fri, 29 Jun 2001, Reto Kohli wrote:

 hello list,
 
 i ran across a problem today which is so
 obvious that i'm wondering why there is
 no built-in workaround (yet?) in the windows
 port..
 (i am not subscribed to this list, so please
 flame me at [EMAIL PROTECTED]; thanks! ;)
 
 consider the following wget call (and think dos:)
 
   wget http://mydomain.org/index.html?foo=bar
 
 well? -- dos (and your average windows, too)
 does of course not allow you to write files
 with question marks in their filenames!
 so, wget will complain
 
   Cannot write to 'mydomain.org/index.html?foo=bar',
 
 which is sadly true, and i will never get those
 wonderful pages.. (b.g. die die die! ;)
 
 -- can this be circumdone without recompiling
 wget? -- any suggestions?
 
 thank you very much and sorry if this
 has already been asked 1e99 times.
 
 ** please cc: any reply to [EMAIL PROTECTED] **
 
 gimi
 
 --
 3d animation and video postproduction
 mailto:[EMAIL PROTECTED]
 http://www.psico.ch/
 
 

Rick Palazola
Scientific Software Engineer
Mutant Mouse Informatics Coordinating Center
Project Site: http://www.jax.org/mmrc/icc/index.html
email: [EMAIL PROTECTED]
phone: 207-288-6440
The Jackson Laboratory
600 Main Street
Bar Harbor, ME  04609-1500




Re: Q: (problem) wget on dos/win: question marks in url

2001-06-29 Thread Ian Abbott

On 29 Jun 2001, at 8:13, Rick Palazola [EMAIL PROTECTED] wrote:
 On Fri, 29 Jun 2001, Reto Kohli wrote:
  consider the following wget call (and think dos:)
  
wget http://mydomain.org/index.html?foo=bar
  
  well? -- dos (and your average windows, too)
  does of course not allow you to write files
  with question marks in their filenames!
  so, wget will complain
  
Cannot write to 'mydomain.org/index.html?foo=bar',
  
  which is sadly true, and i will never get those
  wonderful pages.. (b.g. die die die! ;)
 
 Did you try putting quotes around the URL?  Don't have time to test at the
 moment, but my memory is that I used quoted URL in (this or another) DOS
 app and it did work.

The following characters are not legal in DOS filenames, even long 
(VFAT) filenames:

\ / : * ?|





Re: Standalone html parser

2001-06-29 Thread Anees Shaikh


So I think the problem is with malformed img tags.  The parser fails
if the tag is of this form:

   img src=/library/homepage/images/curve.gif alt= border=0 /

Note the end of the tag is closed with / instead of just  as in
the spec.  When the parser finds the / it thinks it sets
attr_name_begin to the / and then attr_name_end gets set to the same
thing.  

If I edit the html file to change the tag to:

   img src=/library/homepage/images/curve.gif alt= border=0

it is recognized correctly.

Unfortunately in this case the parser also seg faults in the call to
strlen() in the array_allowed() function.  I haven't looked closely at
this yet but it only shows up when a follow_tags list is passed to
map_html_tags.  That is, if you use NULL pointers for follow_tags and
follow_attrs, there is no seg fault.  I was asking the parser to just
tell me about img tags.

This problem with img tags seems to be quite common (redhat.com,
ibm.com, microsoft.com) maybe due to some authoring tools.

Thanks.


-- Anees



  For some reason, img src=...  tags are recognized but then skipped
  almost every time they are encountered.  When using the full program
  and recursive retrieve, the images are in fact retreived so it seems
  that the parser does work correctly when not in standalone mode.
  
  It seems that the following condition is met when parsing img
  tag attributes
  
  /* Establish bounds of attribute name. */
  attr_name_begin = p;/* foo bar ... */
  /*  ^*/
  while (NAME_CHAR_P (*p))
ADVANCE (p);
  attr_name_end = p;  /* foo bar ... */
  /* ^ */
  if (attr_name_begin == attr_name_end)
goto backout_tag;
  
  Can someone shed some light on this?
 
 For some reason, the parser does not advance past the attribute name.
 Try going into the debugger and printing the value of P.  You should
 find out why the parser refuses to advance beyond attr_name_begin.
 
 Perhaps it thinks it has reached the end of file?  (Are you calling it
 with the proper text length?)  Perhaps the text is corrupted due to
 another bug in your program and the attribute name is invalid?  A
 number of things could be wrong.
 
 When I wrote the parser, I primarily tested it in standalone mode,
 so it should work thus.
 





Re: Standalone html parser

2001-06-29 Thread Hrvoje Niksic

Anees Shaikh [EMAIL PROTECTED] writes:

 So I think the problem is with malformed img tags.  The parser fails
 if the tag is of this form:
 
img src=/library/homepage/images/curve.gif alt= border=0 /
[...]
 This problem with img tags seems to be quite common (redhat.com,
 ibm.com, microsoft.com) maybe due to some authoring tools.

That's supposed to be legal XML and some people are using it for
XHTML compliance -- the final / says that the tag is closed
immediately.  I plan to fix the parser to not barf on it.

 Note the end of the tag is closed with / instead of just  as
 in the spec.  When the parser finds the / it thinks it sets
 attr_name_begin to the / and then attr_name_end gets set to the
 same thing.

Yes.  If it weren't for the XML novelties, it would be a feature.

 Unfortunately in this case the parser also seg faults in the call to
 strlen() in the array_allowed() function.

Be careful that you're correctly consing up the arrays (they have to
be NULL-terminated) and that your stack isn't corrupted or something
like that.

Also, I've recently fixed an important bug in DO_REALLOC_FROM_ALLOCA:

Index: wget.h
===
RCS file: /pack/anoncvs/wget/src/wget.h,v
retrieving revision 1.23
retrieving revision 1.25
diff -u -r1.23 -r1.25
--- wget.h  2001/05/27 19:35:15 1.23
+++ wget.h  2001/06/26 09:48:51 1.25
@@ -231,24 +231,24 @@
 {  \
   /* Avoid side-effectualness.  */ \
   long do_realloc_needed_size = (needed_size); \
-  long do_realloc_newsize = 0; \
-  while ((sizevar)  (do_realloc_needed_size)) {   \
-do_realloc_newsize = 2*(sizevar);  \
+  long do_realloc_newsize = (sizevar); \
+  while (do_realloc_newsize  do_realloc_needed_size) {   
+ \
+do_realloc_newsize = 1;  \
 if (do_realloc_newsize  16)   \
   do_realloc_newsize = 16; \
-(sizevar) = do_realloc_newsize;\
   }\
-  if (do_realloc_newsize)  \
+  if (do_realloc_newsize != (sizevar)) \
 {  \
   if (!allocap)\
XREALLOC_ARRAY (basevar, type, do_realloc_newsize); \
   else \
{   \
  void *drfa_new_basevar = xmalloc (do_realloc_newsize);\
- memcpy (drfa_new_basevar, basevar, sizevar);  \
+ memcpy (drfa_new_basevar, basevar, (sizevar));\
  (basevar) = drfa_new_basevar; \
  allocap = 0;  \
}   \
+  (sizevar) = do_realloc_newsize;  \
 }  \
 } while (0)
 



Re: xhtml? Re: Standalone html parser

2001-06-29 Thread Hrvoje Niksic

Anees Shaikh [EMAIL PROTECTED] writes:

 Hrvoje, you mentioned that you planned to modify the parser to
 handle these tags.  Any ideas on timetable?

How about now?  :-)

I have created a simple patch that deals with this.  However,
preliminary testing indicated a problem with the semantics.
Normally, Wget allows this:

img src=/foo/bar.gif

With this patch, this causes the tag to back out (i.e. be proclaimed
invalid) because it expects '/' to be followed by ''.  Of course, I
could make the attribute value swallow everything until whitespace or
'' and *then* look for a '/', but what about this:

img src=foo/

Is the '/' a part of `foo' or does it close the tag?

Is the whitespace preceding the closing '/' mandatory?  If so, we can
make it so that the following all differ:

img src=foo/ # img src=foo/

img src=foo /# img src=foo/img

img src=foo/ /   # img src=foo//img

Any opinions?  Advice?  Standards-lawyer-speak?