Matching zero only once using RE
Hi, I've looked at a lot of pages on the net and still can't seem to nail this. Would someone more knowledgeable in regular expressions please provide some help to point out what I'm doing wrong? I am trying to see if a web page contains the exact text: You have found 0 matches But instead I seem to be matching all sorts of expected line like You have found matches for example: You have found 34 matches You have found 189 matches You have found 16,734 matches You have found 1,706 matches You have found 300 matches The last 2 I thought I had eliminated but sadly it seems not the examples above actually seem to match my expression below. :( Here is what I'm doing: zeromatch = [] SecondarySearchTerm = 'You found (0){1} matches' primarySearchTerm = 'Looking for Something' primarySearchTerm2 = 'has been an error connecting' # pagetext is all the body text on a web page. # I'm using COM to drive MSIE and pagetext = doc.body.outerText if (re.search(primarySearchTerm, pagetext) or re.search(primarySearchTerm2, pagetext)): failedlinks.append(link) elif (re.search(SecondarySearchTerm, pagetext)): zeromatch.append(link) I've tried other RE's be had even more spectacular failures any help would be greatly appreciated. Thanks in Advance, Greg Moore Software Test Shop.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Matching zero only once using RE
Hi thanks to all of you. Mike I like your committee idea. where can I join? lol Greg. -- http://mail.python.org/mailman/listinfo/python-list
Oh what a twisted thread we weave....
Hi First off I'm not using anything from Twisted. I just liked the subject line :) The folks of this list have been most helpful before and I'm hoping that you'll take pity on a the dazed and confused. I've read stuff on this group and various website and book until my head is spinning... Here is a brief summary of what I'm trying to do and an example below. I have the code below in a single threaded version and use it to test a list of roughly 6000 urls ensure that they "work". If they fail I track the kind of failures and then generate a report. Currently it take about 7 - 9 hours to run through the entire list. I basically create a list from a file containing a list of URLS and then iterate over the list and check each page as I go through the list. I get all sort of flack because it takes so long so I thought I could speed it up by using a Queue and X number of threads. Seems easier said then done. However in my test below I can't even get it to catch a single error in my if statement in the Run() function. I'm stumped as to why. Any help would be Greatly appreciated. and if so inclined pointers on how to limit the number of threads of a give number of threads. Thank you in advance! I really do appreciate it here is what I have so far... Yes there are somethings that are unused from previous test. Oh and to give proper credit this is based on some code from http://starship.python.net/crew/aahz/OSCON2000/SCRIPT2.HTM import threading, Queue from time import sleep, time import urllib2 import formatter import string #toscan = Queue.Queue #scanned = Queue.Queue #workQueue = Queue.Queue() MAX_THREADS = 10 timeout = 90# sets timeout for urllib2.urlopen() failedlinks = []# list for failed urls zeromatch = [] # list for 0 result searches t = 0 # used to store starting time for getting a page. pagetime = 0# time it took to load page slowestpage = 0 # slowest page time fastestpage = 10# fastest page time cumulative = 0 # total time to load all pages (used to calc. avg) ST_zeroMatch = 'You found 0 products' ST_zeroMatch2 = 'There are no products matching your selection' class Retriever(threading.Thread): def __init__(self, URL): self.done = 0 self.URL = URL self.urlObj = '' self.ST_zeroMatch = ST_zeroMatch print '__init__:self.URL', self.URL threading.Thread.__init__(self) def run(self): print 'In run()' print "Retrieving:", self.URL #self.page = urllib.urlopen(self.URL) #self.body = self.page.read() #self.page.close() self.t = time() self.urlObj = urllib2.urlopen(self.URL) self.pagetime = time() - t self.webpg = self.urlObj.read() print 'Retriever.run: before if' print 'matching', self.ST_zeroMatch print ST_zeroMatch # why does this always drop through even though the If should be true. if (ST_zeroMatch or ST_zeroMatch2) in self.webpg: # I don't think I want to use self.zeromatch, do I? print '** Found zeromatch' zeromatch.append(url) #self.parse() print 'Retriever.run: past if' print 'exiting run()' self.done = 1 # the last 2 Shop.Com Urls should trigger the zeromatch condition sites = ['http://www.foo.com/', 'http://www.shop.com', 'http://www.shop.com/op/aprod-~zzsome+thing', 'http://www.shop.com/op/aprod-~xyzzy' #'http://www.yahoo.com/ThisPageDoesntExist' ] threadList = [] URLs = [] workQueue = Queue.Queue() for item in sites: workQueue.put(item) print workQueue print print 'b4 test in sites' for test in sites: retriever = Retriever(test) retriever.start() threadList.append(retriever) print 'threadList:' print threadList print 'past for test in sites:' while threading.activeCount()>1: print'Zzz...' sleep(1) print 'entering retriever for loop' for retriever in threadList: #URLs.extend(retriever.run()) retriever.run() print 'zeromatch:', zeromatch # even though there are two URLs that that should be here nothing ever gets appeneded to the list. -- http://mail.python.org/mailman/listinfo/python-list
Re: Oh what a twisted thread we weave....
Tom, Thanks for the reply and sorry for the delay in getting back to you. Thanks for pointing out my logic problem. I had added the 2nd part of the if statement at the last minute... Yes I have a single threaded version its several hundred lines and uses COM to write the results out to and Excel spreadsheet.. I was trying to better understand threading and queues before I started hacking on my current code... maybe that was a mistake... hey I'm still learning and I learn a lot just by reading stuff posted to this group. I hope at some point I can help others in the same way. Here are the relevent parts of the code (no COM stuff) here is a summary: # see if url exists # if exists then # hit page # get text of page # see if text of page contains search terms # if it does then # update appropiate counters and lists # else update static line and do the next one # when done with Links list # - calculate totals and times # - write info to xls file # end. # utils are functions and classes that I wrote # from utils import PrintStatic, HttpExists2 # # My version of 'easyExcel' with extentions and improvements. # import excelled import urllib2 import time import socket import os #import msvcrt # for printstatic from datetime import datetime import pythoncom from sys import exc_info, stdout, argv, exit # search terms to use for matching. #primarySearchTerm = 'Narrow your' ST_lookingFor = 'Looking for Something' ST_errorConnecting = 'there has been an error connecting' ST_zeroMatch = 'You found 0 products' ST_zeroMatch2 = 'There are no products matching your selection' #initialize Globals timeout = 90# sets timeout for urllib2.urlopen() failedlinks = []# list for failed urls zeromatch = [] # list for 0 result searches pseudo404 = [] # list for shop.com 404 pages t = 0 # used to store starting time for getting a page. count = 0 # number of tests so far pagetime = 0# time it took to load page slowestpage = 0 # slowest page time fastestpage = 10# fastest page time cumulative = 0 # total time to load all pages (used to calc. avg) #version number of the program version = 'B2.9' def ShopCom404(testUrl): """ checks url for shop.com 404 url shop.com 404 url -- returns status 200 http://www.shop.com/amos/cc/main/404/ccsyn/260 """ if '404' in testUrl: return True else: return False # main program # try: links = open(testfile).readlines() except: exc, err, tb = exc_info() print 'There is a problem with the file you specified. Check the file and re-run the program.\n' #print str(exc) print str(err) print exit() # timeout in seconds socket.setdefaulttimeout(timeout) totalNumberTests = len(links) print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n' # asctime() returns a human readable time stamp whereas time() doesn't startTimeStr = time.asctime() start = datetime.today() for url in links: count = count + 1 #HttpExists2 - checks to see if URL exists and detects redirection. # handles 404's and exceptions better. Returns tuple depending on results: # if found: true and final url. if not found: false and attempted url pgChk = HttpExists2(url) if pgChk[0] == False: #failed url Exists failedlinks.append(pgChk[1]) elif ShopCom404(pgChk[1]): #Our version of a 404 pseudo404.append(url) if pgChk[0] and not ShopCom404(url): #if valid page not a 404 then get the page and check it. try: t = time.time() urlObj = urllib2.urlopen(url) pagetime = time.time() - t webpg = urlObj.read() if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg): zeromatch.append(url) elif ST_errorConnecting in webpg: # for some reason we got the error page # so add it to the failed urls failmsg = 'Error Connecting Page with: ' + url failedlinks.append(failmsg) except: print 'exception with: ' + url #figure page times cumulative += pagetime if pagetime > slowestpage: slowestpage = pagetime, url.strip() elif pagetime < fastestpage: fastestpage = pagetime, url.strip() msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \ '. Currnet runt
Calculating average time
Hi, I'm hoping that someone can point me in the right direction with this. What I would like to do is calculate the average time it takes to load a page. I've been searching the net and reading lots but I haven't found anything that helps too much. I'm testing our web site and hiting +6000 urls per test. Here is a subset of what I'm doing. import IEC #IE controller from http://www.mayukhbose.com/python/IEC/index.php from win32com.client import Dispatch import time import datetime from sys import exc_info, stdout, argv, exit failedlinks = [] links = open(testfile).readlines() totalNumberTests = len(links) ie = IEC.IEController() start = datetime.datetime.today() # asctime() returns a human readable time stamp whereas time() doesn't startTimeStr = time.asctime() for link in links: start = datetime.datetime.today() ie.Navigate(link) end = datetime.datetime.today() pagetext = ie.GetDocumentText() #check the returned web page for some things if not (re.search(searchterm, pagetext): failedlinks.append(link) ie.CloseWindow() finised = datetime.datetime.today() finishedTimeStr = time.asctime() # then I print out results, times and etc. So: 1. Is there a better time function to use? 2. To calculate the average times do I need to split up min, sec, and msec and then just do a standard average calculation or is there a better way? 3. is there a more efficient way to do this? 4. kind of OT but is there any control like this for Mozilla or firefox? This is not intended to be any sort of load tester just a url validation and page check. Thanks in advance. Greg. -- http://mail.python.org/mailman/listinfo/python-list
Re: Calculating average time
Thanks Skip. As usual I want to make it harder then it actually is. -- http://mail.python.org/mailman/listinfo/python-list
Re: import Help Needed - Newbie
A search on google for odbchelper resulted in: http://linux.duke.edu/~mstenner/free-docs/diveintopython-3.9-1/py/odbchelper.py I think this will help you. Greg. -- http://mail.python.org/mailman/listinfo/python-list