On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine <nor...@khine.net> wrote: > On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson <ken...@tds.net> wrote: >> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine <nor...@khine.net> wrote: >>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson <ken...@tds.net> wrote: >>>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine <nor...@khine.net> wrote:
>>>> Why do you use repr() here? >> >> It smells of programming by guess rather than a correct solution to >> some problem. What happens if you take it out? > > when i take it out, i get an empty list. > > whereas both > data = repr( file.read().decode('latin-1') ) > and > data = repr( file.read().decode('utf-8') ) > > returns the full list. Try this version: data = file.read() get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: myIcon\n""", re.DOTALL).findall get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall then as before. Your repr() call is essentially removing newlines from the input by converting them to literal '\n' pairs. This allows your regex to work without the DOTALL modifier. Note you will get slightly different results with my version - it will give you correct utf-8 text for the titles whereas yours gives \ escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your version returns {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} Mine gives {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} This is showing the repr() of the title so they both have \ but note that yours has two \\ indicating that the \ is in the text; mine has only one \. Kent _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor