On Mon, Oct 15, 2012 at 1:17 PM, Norman Khine <nor...@khine.net> wrote: > > i made an update: https://gist.github.com/3891927 which works based on > some of the recommendations posted here. > > any suggestions for improvement?
I can't make any specific recommendations about using BeautifulSoup since I haven't used it in a long time. Here are a few general suggestions. PEP 8 recommends using 4-space indents instead of tabs, if possible. Move constant definitions out of loops. For example, there's no point to repeatedly building the dict literal and assigning it to "headers" for each page_no. Also, "headers" isn't used. You need to make a Request object: http://docs.python.org/library/urllib2#urllib2.Request Don't put unnecessary parentheses (round brackets) around assigned expressions. For example, replace req = ('%s%s' % (origin_site, page_no)) ... assoc_link = (tag.string) with req = '%s%s' % (origin_site, page_no) ... assoc_link = tag.string I'd also move the "%s" into origin_site: origin_site = ('http://typo3.nimes.fr/index.php?' 'id=annuaire_assos&theme=0&rech=&num_page=%s') req = origin_site % page_no The parentheses around the string literal are necessary because it's split over two lines. (I use Gmail's webmail interface, which line wraps at 69 columns -- sometimes; it tries to be clever.) Use "continue" to avoid unnecessary nesting: for page_no in pages: ... try: doc = urllib2.urlopen(req) except urllib2.URLError, e: continue soup = BeautifulSoup.BeautifulSoup(doc) for row in soup.findAll('tr', {'class': 'menu2'}): assoc_data = [] ... for i in soup.findAll('a', {'href': '#'}): if 'associations' not in i.attrMap['onclick']: continue req = i.attrMap['onclick'].split("'")[1] try: doc = urllib2.urlopen(req) except urllib2.URLError, e: continue soup = BeautifulSoup.BeautifulSoup(doc) Don't test "if emails != []". Use "if emails" instead. A non-empty list is always True, and an empty list is always False. Finally, in several places you create an empty list and populate it with a for loop. For simple cases like these, using comprehensions and generator expressions executes faster, yet it's still easy to code and easy to understand. For example, you can replace the following: assoc_data = [] for assoc_theme in soup.findAll('u'): assoc_data.append(assoc_theme.renderContents()) for assoc_name in soup.findAll('td', {'width': '70%'}): assoc_data.append(assoc_name.renderContents()) with something like this: assoc_data = [assoc_theme.renderContents() for assoc_theme in soup.findAll('u')] assoc_data.extend(assoc_name.renderContents() for assoc_name in soup.findAll('td', {'width': '70%'})) At least to me this is just as readable as the original, and the generated code is more efficient. If, however, BeautifulSoup is the limiting factor here, the efficiency gain will be chump change in the big picture. Still, in a simple case like this it's hard to argue against using a comprehension or generator expression. In cases that use complex expressions for the value and conditions, I think it makes more sense to use a for loop, which is easier to read and debug. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor