Bob Ramsey wrote:

Thanks for the replies.  This appears to be the right answer:

where page_body regexp '.*<img .*>.*' and page_body not regexp '.*<img .* alt= .*>.*';

I don't think so. Here's why:

First, REGEXPs can match partial strings, so the .* is unnecessary at the beginning and end. Second, no space is required after alt=, so that should be "alt=.*" instead of "alt= .*", but neither of those is the big problem.

The regexp

  page_body REGEXP '<img .*>'

will match rows where  page_body has at least one <img..> tag.

The regexp

  page_body REGEXP '<img .* alt=.*>'

will match rows where page_body has at least one <img..> tag and at least one "alt=" somewhere afterward. So,

  page_body REGEXP '<img .*>' AND page_body NOT REGEXP '<img .* alt=.*>'

matches rows where page_body has at least one <img..> tag and *NO* "alt=" anywhere after the first "<img".

Do you see the problem?  Consider a row with the following in the page_body:

 'blah, blah <img src="image1.jpg"> blah, <img src="image2.jpg" alt="2">'

<img .* alt=.*> matches the second img tag, so this row is not returned.

Now consider

  'blah, blah <img src="math1.jpg"> alt=3, <img src="math2.jpg">'
              <img |-------.*-----| alt=|--------.*-----------|>

<img .* alt=.*> matches this row, too, so it also is not returned.

Now consider

  'blah, blah <img src="math1.jpg" alt=""> blah'

This also matches '<img .* alt=.*>', so it won't be returned.

From your description, I think you want to find these rows.

We could improve the regexp by insisting that the alt text be part of the img tag, like this:

  body REGEXP '<img [^>]* alt=".+"'

The '[^>]*' means "any number of characters which are not '>'". The '.+' requires that there be at least one character in the alt text.

Unfortunately, this still doesn't fix the basic problem. If a single img tag has alt text, you won't get the row no matter how many images in that row's page_body don't have alt text. If I understand you correctly, you want to locate all pages where any img tag is missing the alt text. I don't think that can easily be accomplished in mysql. You would have to make a list of all the forms an img tag without alt text could take, make a regexp for each one, and link them all together with ORs in your WHERE clause. Here's a start:

WHERE
s RLIKE '<img src="[^"]+">'
OR
s RLIKE '<img src="[^"]*" width="?[[:digit:]]+"? height="?[[:digit:]]+"?>'
OR
s RLIKE '<img src="[^"]*" height="?[[:digit:]]+"? width="?[[:digit:]]+"?>';

(replace s with page_body) This query will be a slow, full-table scan, which will only get slower as you add more regexps. Essentially, what we are doing here is trying to write SQL to have the database parse the stored HTML for validity. I'd suggest the db is the wrong tool, as it wasn't designed for that. Add in the fact that other things could be wrong besides missing alt text, and I think your best bet would be to extract the pages and parse them in an external app specifically designed to validate HTML.

Michael

bob

======================================================================
Bob Ramsey          SYSTEMS ADMINISTRATION AND SYSTEMS PROGRAMMING III
MA, Management of Information Systems 2004
MA, English Literature, 1992
ph:  1(319)335-9956                              187 Boyd Law Building
fax: 1(319)335-9019                  University of Iowa College of Law
mailto:[EMAIL PROTECTED]                Iowa City, IA 52242-1113
For Hardware and Software questions, call 5-9124
======================================================================


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]



  • ... Paul DuBois
    • ... Bob Ramsey
      • ... Chris Blackwell
        • ... Bob Ramsey
      • ... Santino
        • ... Gerald Taylor
      • ... SGreen
        • ... Bob Ramsey
          • ... Michael Stassen
    • ... Dave Shariff Yadallee - System Administrator a.k.a. The Root of the Problem

Reply via email to