So it turns out the main bug does not affect amd64 (or more generally, roughly 
systems where (LONG_MAX>>2) >= INT_MAX).

There are at least two bugs in the tag comparison:
- Assumes number equality is the same as VALUE equality.
- Assumes hash equality is the same as string equality.

The first causes the bug: Small numbers are stored as (VALUE)((n<<1)|1); 
numbers not representable as such are stored as a bignum. If you're fortunate 
enough to be using amd64, the size of a VALUE (unsigned long) is sufficiently 
larger than the size of a hash (int) that you can always represent it as a 
"fixed" number and you never see this bug.

The second bug means sometimes <tag></othertag> will appear to be valid if tag 
and othertag have the same hash and both hashes are "fixable". Searching for 
hash collisions is not that difficult - the first example should work on amd64, 
the second works on x86:

  $ ruby -r hpricot -e "print Hpricot.XML('<afuaf></zhgaa>')"
  <afuaf></zhgaa></afuaf>

  $ ruby -r hpricot -e "print Hpricot.XML('<aaaayi></zkbaaa>')"
  <aaaayi></aaaayi>


The patch below never bothers with the hash comparison because you have to do 
the string comparison anyway. If you want to re-add the hash comparison as a 
possible optimization (for e.g. the HTML case), the better way is to replace 
INT2NUM with INT2FIX instead of doing a bignum comparison.  (hpricot_scan.rl is 
also patched, though you can just apply the same patch file to both.)

The other two bugs are still there:
- The "BogusETag" prints as </tag>, so it still produces potentially invalid 
XML. It should probably print something more sensible (<tag hpricot="bogus 
whatever"/>?).
- Invalid XML is still accepted in XML mode, while the correct thing to do is 
fail.


There's another oddity in hpricot_scan.rl/hpricot_scan.c:
- The definition of H_ELE(N) for a cBogusETag does H_ELE_SET(ele, H_ELE_ATTR, 
rb_str_new(raw, rawlen));. I'm not sure what this is for (ele.attr should be a 
dict, for a start).


Anyone feel like submitting this upstream?

--- hpricot_scan.c.0    2009-07-10 13:48:13.000000000 +0100
+++ hpricot_scan.c.3    2009-07-10 14:19:16.000000000 +0100
@@ -326,11 +326,11 @@
     // (see also: the search above for fixups)
     //
     name = INT2NUM(rb_str_hash(tag));
     while (e != S->doc)
     {
-      if (H_ELE_GET(e, H_ELE_HASH) == name)
+      if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0)
       {
         match = e;
         break;
       }
 
--- hpricot_scan.rl.0   2009-07-10 14:43:44.000000000 +0100
+++ hpricot_scan.rl.3   2009-07-10 14:44:07.000000000 +0100
@@ -359,11 +359,11 @@
     // (see also: the search above for fixups)
     //
     name = INT2NUM(rb_str_hash(tag));
     while (e != S->doc)
     {
-      if (H_ELE_GET(e, H_ELE_HASH) == name)
+      if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0)
       {
         match = e;
         break;
       }
 



-- 
To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to