So it turns out the main bug does not affect amd64 (or more generally, roughly systems where (LONG_MAX>>2) >= INT_MAX).
There are at least two bugs in the tag comparison: - Assumes number equality is the same as VALUE equality. - Assumes hash equality is the same as string equality. The first causes the bug: Small numbers are stored as (VALUE)((n<<1)|1); numbers not representable as such are stored as a bignum. If you're fortunate enough to be using amd64, the size of a VALUE (unsigned long) is sufficiently larger than the size of a hash (int) that you can always represent it as a "fixed" number and you never see this bug. The second bug means sometimes <tag></othertag> will appear to be valid if tag and othertag have the same hash and both hashes are "fixable". Searching for hash collisions is not that difficult - the first example should work on amd64, the second works on x86: $ ruby -r hpricot -e "print Hpricot.XML('<afuaf></zhgaa>')" <afuaf></zhgaa></afuaf> $ ruby -r hpricot -e "print Hpricot.XML('<aaaayi></zkbaaa>')" <aaaayi></aaaayi> The patch below never bothers with the hash comparison because you have to do the string comparison anyway. If you want to re-add the hash comparison as a possible optimization (for e.g. the HTML case), the better way is to replace INT2NUM with INT2FIX instead of doing a bignum comparison. (hpricot_scan.rl is also patched, though you can just apply the same patch file to both.) The other two bugs are still there: - The "BogusETag" prints as </tag>, so it still produces potentially invalid XML. It should probably print something more sensible (<tag hpricot="bogus whatever"/>?). - Invalid XML is still accepted in XML mode, while the correct thing to do is fail. There's another oddity in hpricot_scan.rl/hpricot_scan.c: - The definition of H_ELE(N) for a cBogusETag does H_ELE_SET(ele, H_ELE_ATTR, rb_str_new(raw, rawlen));. I'm not sure what this is for (ele.attr should be a dict, for a start). Anyone feel like submitting this upstream? --- hpricot_scan.c.0 2009-07-10 13:48:13.000000000 +0100 +++ hpricot_scan.c.3 2009-07-10 14:19:16.000000000 +0100 @@ -326,11 +326,11 @@ // (see also: the search above for fixups) // name = INT2NUM(rb_str_hash(tag)); while (e != S->doc) { - if (H_ELE_GET(e, H_ELE_HASH) == name) + if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0) { match = e; break; } --- hpricot_scan.rl.0 2009-07-10 14:43:44.000000000 +0100 +++ hpricot_scan.rl.3 2009-07-10 14:44:07.000000000 +0100 @@ -359,11 +359,11 @@ // (see also: the search above for fixups) // name = INT2NUM(rb_str_hash(tag)); while (e != S->doc) { - if (H_ELE_GET(e, H_ELE_HASH) == name) + if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0) { match = e; break; } -- To UNSUBSCRIBE, email to debian-bugs-rc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org