On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine <nor...@khine.net> wrote:
> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson <ken...@tds.net> wrote:
>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine <nor...@khine.net> wrote:
>>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson <ken...@tds.net> wrote:
>>>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine <nor...@khine.net> wrote:

>>>> Why do you use repr() here?

>>
>> It smells of programming by guess rather than a correct solution to
>> some problem. What happens if you take it out?
>
> when i take it out, i get an empty list.
>
> whereas both
> data = repr( file.read().decode('latin-1') )
> and
> data = repr( file.read().decode('utf-8') )
>
> returns the full list.

Try this version:

data = file.read()

get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
myIcon\n""", re.DOTALL).findall
get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall
get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
get_latlngs = 
re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall

then as before.

Your repr() call is essentially removing newlines from the input by
converting them to literal '\n' pairs. This allows your regex to work
without the DOTALL modifier.

Note you will get slightly different results with my version - it will
give you correct utf-8 text for the titles whereas yours gives \
escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
version returns

{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

Mine gives
{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

This is showing the repr() of the title so they both have \ but note
that yours has two \\ indicating that the \ is in the text; mine has
only one \.

Kent
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to