from:"GregM"

Matching zero only once using RE

2005-10-07 Thread GregM

Hi,

I've looked at a lot of pages on the net and still can't seem to nail
this. Would someone more knowledgeable in regular expressions please
provide some help to point out what I'm doing wrong?

I am trying to see if a web page contains the exact text:
You have found 0 matches

But instead I seem to be matching all sorts of expected line like
You have found  matches
for example:
You have found 34 matches
You have found 189 matches
You have found 16,734 matches
You have found 1,706 matches
You have found 300 matches

The last 2 I thought I had eliminated but sadly it seems not the
examples above actually seem to match my expression below. :(

Here is what I'm doing:
zeromatch = []
SecondarySearchTerm = 'You found (0){1} matches'
primarySearchTerm = 'Looking for Something'
primarySearchTerm2 = 'has been an error connecting'

# pagetext is all the body text on a web page.
# I'm using COM to drive MSIE and pagetext =  doc.body.outerText

if (re.search(primarySearchTerm, pagetext)  or
re.search(primarySearchTerm2, pagetext)):
   failedlinks.append(link)
elif (re.search(SecondarySearchTerm, pagetext)):
   zeromatch.append(link)

I've tried other RE's be had even more spectacular failures any help
would be greatly appreciated.

Thanks in Advance,
Greg Moore
Software Test
Shop.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Matching zero only once using RE

2005-10-07 Thread GregM

Hi
thanks to all of you. Mike I like your committee idea. where can I
join? lol

Greg.

-- 
http://mail.python.org/mailman/listinfo/python-list

Oh what a twisted thread we weave....

2005-10-28 Thread GregM

Hi

First off I'm not using anything from Twisted. I just liked the subject
line :)

The folks of this list have been most helpful before and I'm hoping
that you'll take pity on a the dazed and confused. I've read stuff on
this group and various website and book until my head is spinning...

Here is a brief summary of what I'm trying to do and an example below.
I have the code below in a single threaded version and use it to test a
list of roughly 6000 urls ensure that they "work". If they fail I track
the kind of failures and then generate a report. Currently it take
about 7 - 9 hours to run through the entire list. I basically create a
list from a file containing a  list of URLS and then iterate over the
list and check each page as I go through the list. I get all sort of
flack because it takes so long so I thought I could speed it up by
using a Queue and X number of threads. Seems easier said then done.

However in my test below I can't even get it to catch a single error in
my if statement in the Run() function. I'm stumped as to why. Any help
would be Greatly appreciated. and if so inclined pointers on how to
limit the number of threads of a give number of threads.

Thank you in advance! I really do appreciate it

here is what I have so far... Yes there are somethings that are unused
from previous test. Oh and to give proper credit this is based on some
code from  http://starship.python.net/crew/aahz/OSCON2000/SCRIPT2.HTM

import threading, Queue
from time import sleep, time
import urllib2
import formatter
import string
#toscan = Queue.Queue
#scanned = Queue.Queue
#workQueue = Queue.Queue()


MAX_THREADS = 10

timeout = 90# sets timeout for urllib2.urlopen()
failedlinks = []# list for failed urls
zeromatch = []  # list for 0 result searches
t = 0   # used to store starting time for 
getting a page.
pagetime = 0# time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10# fastest page time
cumulative = 0  # total time to load all pages (used to calc. 
avg)
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

class Retriever(threading.Thread):
def __init__(self, URL):
self.done = 0
self.URL = URL
self.urlObj = ''
self.ST_zeroMatch = ST_zeroMatch
print '__init__:self.URL', self.URL
threading.Thread.__init__(self)

def run(self):
print 'In run()'
print "Retrieving:", self.URL
#self.page = urllib.urlopen(self.URL)
#self.body = self.page.read()
#self.page.close()
self.t = time()
self.urlObj = urllib2.urlopen(self.URL)
self.pagetime = time() - t
self.webpg = self.urlObj.read()
print 'Retriever.run: before if'
print 'matching', self.ST_zeroMatch
print ST_zeroMatch
# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:
# I don't think I want to use self.zeromatch, do I?
print '** Found zeromatch'
zeromatch.append(url)
#self.parse()
print 'Retriever.run: past if'
print 'exiting run()'
self.done = 1

# the last 2 Shop.Com Urls should trigger the zeromatch condition
sites = ['http://www.foo.com/',
'http://www.shop.com',
'http://www.shop.com/op/aprod-~zzsome+thing',
'http://www.shop.com/op/aprod-~xyzzy'
#'http://www.yahoo.com/ThisPageDoesntExist'
]

threadList = []
URLs = []
workQueue = Queue.Queue()

for item in sites:
workQueue.put(item)

print workQueue
print
print 'b4 test in sites'

for test in sites:
retriever = Retriever(test)
retriever.start()
threadList.append(retriever)

print 'threadList:'
print threadList
print 'past for test in sites:'

while threading.activeCount()>1:
print'Zzz...'
sleep(1)

print 'entering retriever for loop'
for retriever in threadList:
#URLs.extend(retriever.run())
retriever.run()

print 'zeromatch:', zeromatch
# even though there are two URLs that that should be here nothing ever
gets appeneded to the list.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Oh what a twisted thread we weave....

2005-10-31 Thread GregM

Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...

Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
#   hit page
#   get text of page
#   see if text of page contains search terms
#   if it does then
#   update appropiate counters and lists
#   else update static line and do the next one
# when done with Links list
#   - calculate totals and times
#   - write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90# sets timeout for urllib2.urlopen()
failedlinks = []# list for failed urls
zeromatch = []  # list for 0 result searches
pseudo404 = []  # list for shop.com 404 pages
t = 0   # used to store starting time for 
getting a page.
count = 0   # number of tests so far
pagetime = 0# time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10# fastest page time
cumulative = 0  # total time to load all pages (used to calc. 
avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
""" checks url for shop.com 404 url
shop.com 404 url -- returns status 200
http://www.shop.com/amos/cc/main/404/ccsyn/260
"""
if '404' in testUrl:
return True
else:
return False

# main program #

try:
links = open(testfile).readlines()
except:
exc, err, tb = exc_info()
print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
#print str(exc)
print str(err)
print
exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
count = count + 1
#HttpExists2 - checks to see if URL exists and detects redirection.
# handles 404's and exceptions better. Returns tuple depending on
results:
# if found: true and final url. if not found: false and attempted url
pgChk = HttpExists2(url)
if pgChk[0] == False:
#failed url Exists
failedlinks.append(pgChk[1])
elif ShopCom404(pgChk[1]):
#Our version of a 404
pseudo404.append(url)
if pgChk[0] and not ShopCom404(url):
#if valid page not a 404 then get the page and check it.
try:
t = time.time()
urlObj = urllib2.urlopen(url)
pagetime = time.time() - t
webpg = urlObj.read()
if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in 
self.webpg):
zeromatch.append(url)
elif ST_errorConnecting in webpg:
# for some reason we got the error page
# so add it to the failed urls
failmsg = 'Error Connecting Page with: ' + url
failedlinks.append(failmsg)
except:
print 'exception with: ' + url
#figure page times
cumulative += pagetime
if pagetime > slowestpage:
slowestpage = pagetime, url.strip()
elif pagetime < fastestpage:
fastestpage = pagetime, url.strip()
msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
'. Currnet runt

Calculating average time

2005-07-07 Thread GregM

Hi,
I'm hoping that someone can point me in the right direction with this.
What I would like to do is calculate the average time it takes to load
a page. I've been searching the net and reading lots but I haven't
found anything that helps too much. I'm testing our web site and hiting
+6000 urls per test. Here is a subset of what I'm doing.

import IEC
#IE controller from http://www.mayukhbose.com/python/IEC/index.php
from win32com.client import Dispatch
import time
import datetime
from sys import exc_info, stdout, argv, exit
failedlinks = []
links = open(testfile).readlines()
totalNumberTests = len(links)
ie = IEC.IEController()
start = datetime.datetime.today()
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
for link in links:
start = datetime.datetime.today()
ie.Navigate(link)
end = datetime.datetime.today()
pagetext = ie.GetDocumentText()
#check the returned web page for some things
if not (re.search(searchterm, pagetext):
 failedlinks.append(link)
ie.CloseWindow()
finised = datetime.datetime.today()
finishedTimeStr = time.asctime()
# then I print out results, times and etc.

So:
1. Is there a better time function to use?

2. To calculate the average times do I need to split up min, sec, and
msec and then just do a standard average calculation or is there a
better way?

3. is there a more efficient way to do this?

4. kind of OT but is there any control like this for Mozilla or
firefox?

This is not intended to be any sort of load tester just a url
validation and page check.

Thanks in advance.
Greg.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Calculating average time

2005-07-07 Thread GregM

Thanks Skip. As usual I want to make it harder then it actually is.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: import Help Needed - Newbie

2005-07-07 Thread GregM

A search on google for odbchelper resulted in:
http://linux.duke.edu/~mstenner/free-docs/diveintopython-3.9-1/py/odbchelper.py

I think this will help you.
Greg.

-- 
http://mail.python.org/mailman/listinfo/python-list

Matching zero only once using RE

Re: Matching zero only once using RE

Oh what a twisted thread we weave....

Re: Oh what a twisted thread we weave....

Calculating average time

Re: Calculating average time

Re: import Help Needed - Newbie

7 matches

Site Navigation

Mail list logo

Footer information