Reed L. O'Brien wrote: > I ma trying to write a script to search adn replace a sizable chink of > text in about 460 html pages. > It is an old form that usesa search engine no linger availabe. > > I am wondering if anyone has any advice on the best way to go about that. > There are more than one layout place ment for the form, but I would be > happy to correct the last few by hand as more than 90% are the same. > > So my ideas have been, > use regex to find <form>.*</form> and replace it with <form>newform</form>. > Unfortunately there is more than just search form. So this would just > clobber all of them. So I could <form>.*knownName of > SearchButton.*</form> --> <form>newform</form>
If you are sure 'knownName of SearchButton' only occurs in the form you want to replace, this seems like a good option. Only use non-greedy matching <form>.*?knownName of SearchButton.*?</form> Without the ? you will match from the start of the first form in the page, to the end of the last form, as long as the search form is one of them. > > Or should I read each file in as a big string and break on the form > tags, test the strings as necessary ad operate on them if the conditions > are met. Unfortunaltely I think there are wide variances in white > space and lines breaks. Even the order of the tags is inconsistent. So > I think I am stuck with the first option... > > Unless there is some module or package I haven't found that works on > html in just the way that I want. I found htmlXtract but it is for > Python 1.5 and not immediately intuitive. You might be able to find a module that will read the HTML into a structured form, work on that, and write it out again. Whether this is easy or practical depends a lot on how well-formed your HTML is, and how important it is to keep exactly the same form when you write it back out. You could take a look at ElementTidy for example. http://effbot.org/zone/element-tidylib.htm But I think the regex solution sounds good. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor