Re: [Tutor] parse text file

2010-06-03 Thread Steven D'Aprano
On Fri, 4 Jun 2010 12:45:52 am Colin Talbert wrote:

> I thought when you did a for uline in input_file each single line
> would go into memory independently, not the entire file.

for line in file:

reads one line at a time, but file.read() tries to read everything in 
one go. However, it should fail with MemoryError, not just stop 
silently.

> I'm pretty sure that this is not your code, because you can't call
> len() on a bz2 file. If you try, you get an error:
>
> You are so correct.  I'd been trying numerous things to read in this
> file and had deleted the code that I meant to put here and so wrote
> this from memory incorrectly.  The code that I wrote should have
> been:
>
> import bz2
> input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> str=input_file.read()
> len(str)
>
> Which indeed does return only 90.


Unfortunately, I can't download your bz2 file myself to test it, but I 
think I *may* have found the problem. It looks like the current bz2 
module only supports files written as a single stream, and not multiple 
stream files. This is why the BZ2File class has no "append" mode. See 
this bug report:

http://bugs.python.org/issue1625

My hypothesis is that your bz2 file consists of either multiple streams, 
or multiple bz2 files concatenated together, and the BZ2File class 
stops reading after the first.

I can test my hypothesis:

>>> bz2.BZ2File('a.bz2', 'w').write('this is the first chunk of text')
>>> bz2.BZ2File('b.bz2', 'w').write('this is the second chunk of text')
>>> bz2.BZ2File('c.bz2', 'w').write('this is the third chunk of text')
>>> # concatenate the files
... d = file('concate.bz2', 'w')
>>> for name in "abc":
... f = file('%c.bz2' % name, 'rb')
... d.write(f.read())
...
>>> d.close()
>>>
>>> bz2.BZ2File('concate.bz2', 'r').read()
'this is the first chunk of text'

And sure enough, BZ2File only sees the first chunk of text!

But if I open it in a stand-alone bz2 utility (I use the Linux 
application Ark), I can see all three chunks of text. So I think we 
have a successful test of the hypothesis.


Assuming this is the problem you are having, you have a number of 
possible solutions:

(1) Re-create the bz2 file from a single stream.

(2) Use another application to expand the bz2 file and then read 
directly from that, skipping BZ2File altogether.

(3) Upgrade to Python 2.7 or 3.2, and hope the patch is applied.

(4) Backport the patch to your version of Python and apply it yourself.

(5) Write your own bz2 utility.

Not really a very appetising series of choices there, I must admit. 
Probably (1) or (2) are the least worst.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Sander Sweers
On 3 June 2010 21:02, Colin Talbert  wrote:

> I couldn't find any example of it in use and wasn't having any luck getting
> it to work based on the documentation.


Good examples of the bz2 module can be found at [1].

greets
Sander

[1] http://www.doughellmann.com/PyMOTW/bz2/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Vincent Davis
On Thu, Jun 3, 2010 at 1:02 PM, Colin Talbert  wrote:

>
> Dave,
> I think you are probably right about using decompressor.  I
> couldn't find any example of it in use and wasn't having any luck getting it
> to work based on the documentation.  Maybe I should try harder on this
> front.
>

Is it possible write a python script to transfer this to a hdf5 file?  Would
this help?
Thanks
Vincent


> Colin Talbert
> GIS Specialist
> US Geological Survey - Fort Collins Science Center
> 2150 Centre Ave. Bldg. C
> Fort Collins, CO 80526
>
> (970) 226-9425
> talbe...@usgs.gov
>
>
>
>  From: Dave Angel  To:
> Colin Talbert 
> Cc: Steven D'Aprano , tutor@python.org Date: 06/03/2010
> 12:36 PM Subject: Re: [Tutor] parse text file
> --
>
>
>
> Colin Talbert wrote:
> > 
> > You are so correct.  I'd been trying numerous things to read in this file
>
> > and had deleted the code that I meant to put here and so wrote this from
> > memory incorrectly.  The code that I wrote should have been:
> >
> > import bz2
> > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> > str=input_file.read()
> > len(str)
> >
> > Which indeed does return only 90.
> >
> > Which is also the number returned when you sum the length of all the
> lines
> > returned in a for line in file with:
> >
> >
> > import bz2
> > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> > lengthz = 0
> > for uline in input_file:
> > lengthz = lengthz + len(uline)
> >
> > print lengthz
> >
> > 
> >
> >
> Seems to me for such a large file you'd have to use
> bz2.BZ2Decompressor.  I have no experience with it, but its purpose is
> for sequential decompression -- decompression where not all the data is
> simultaneously available in memory.
>
> DaveA
>
>
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
  *Vincent Davis
720-301-3003 *
vinc...@vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Colin Talbert
Dave,
I think you are probably right about using decompressor.  I 
couldn't find any example of it in use and wasn't having any luck getting 
it to work based on the documentation.  Maybe I should try harder on this 
front.

Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov




From:
Dave Angel 
To:
Colin Talbert 
Cc:
Steven D'Aprano , tutor@python.org
Date:
06/03/2010 12:36 PM
Subject:
Re: [Tutor] parse text file



Colin Talbert wrote:
> 
> You are so correct.  I'd been trying numerous things to read in this 
file 
> and had deleted the code that I meant to put here and so wrote this from 

> memory incorrectly.  The code that I wrote should have been:
>
> import bz2
> input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> str=input_file.read()
> len(str)
>
> Which indeed does return only 90.
>
> Which is also the number returned when you sum the length of all the 
lines 
> returned in a for line in file with:
>
>
> import bz2
> input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
> lengthz = 0
> for uline in input_file:
> lengthz = lengthz + len(uline)
>
> print lengthz
>
> 
> 
>
Seems to me for such a large file you'd have to use 
bz2.BZ2Decompressor.  I have no experience with it, but its purpose is 
for sequential decompression -- decompression where not all the data is 
simultaneously available in memory.

DaveA



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Dave Angel

Colin Talbert wrote:


You are so correct.  I'd been trying numerous things to read in this file 
and had deleted the code that I meant to put here and so wrote this from 
memory incorrectly.  The code that I wrote should have been:


import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)

Which indeed does return only 90.

Which is also the number returned when you sum the length of all the lines 
returned in a for line in file with:



import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
lengthz = lengthz + len(uline)

print lengthz


  

Seems to me for such a large file you'd have to use 
bz2.BZ2Decompressor.  I have no experience with it, but its purpose is 
for sequential decompression -- decompression where not all the data is 
simultaneously available in memory.


DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Alan Gauld


"Colin Talbert"  wrote

I thought when you did a for uline in input_file each single line 
would go

into memory independently, not the entire file.


Thats true but your code snippet showed you using read()
which reads the whole file...

I'm pretty sure that this is not your code, because you can't call 
len()

on a bz2 file. If you try, you get an error:

You are so correct.  I'd been trying numerous things to read in this 
file
and had deleted the code that I meant to put here and so wrote this 
from

memory incorrectly.  The code that I wrote should have been:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)


This again usees read() which reads the whole file.

Which is also the number returned when you sum the length of all the 
lines

returned in a for line in file with:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
   lengthz = lengthz + len(uline)


I'm not sure how

for line in file

will work for binary files. It may read the whole thing since
the concept of lines really only applies to text. So it may
be the same result as using read()

Try looping using read(n) where n is some buffer size
(1024 might be a good value?).

HTH,

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Colin Talbert
Hello Steven,
Thanks for the reply.  Also this is my first post to tu...@python 
so I'll reply all in the future.


However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

I thought when you did a for uline in input_file each single line would go 
into memory independently, not the entire file.



I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:

You are so correct.  I'd been trying numerous things to read in this file 
and had deleted the code that I meant to put here and so wrote this from 
memory incorrectly.  The code that I wrote should have been:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)

Which indeed does return only 90.

Which is also the number returned when you sum the length of all the lines 
returned in a for line in file with:


import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
lengthz = lengthz + len(uline)

print lengthz


Thanks again for you help and sorry for the bad code in the previous 
submittal.


Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov




From:
Steven D'Aprano 
To:
tutor@python.org
Date:
06/02/2010 03:42 PM
Subject:
Re: [Tutor] parse text file
Sent by:
tutor-bounces+talbertc=usgs@python.org



Hi Colin,

I'm taking the liberty of replying to your message back to the list, as 
others hopefully may be able to make constructive comments. When 
replying, please ensure that you reply to the tutor mailing list rather 
than then individual.


On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote:

> > Without seeing your text file, and the code you use to read the text
> > file, there's no way of telling what is going on, but I can guess
> > the most likely causes:
>
> Since the file is 9.2 gig it wouldn't make sense to send it to you. 

And I am very glad you didn't try *smiles*

However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

> > (2) There's a bug in your code so that you stop reading after
> > 900,000 bytes.
> The code is simple enough that I'm pretty sure there is not a
> bug in it.
>
> import bz2
> input_file =
> bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print
> len(input_file)
>
> returns 90

I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:


>>> x = bz2.BZ2File('test.bz2', 'w')  # create a temporary file
>>> x.write("some data")
>>> x.close()
>>> input_file = bz2.BZ2File('test.bz2', 'r')  # open it
>>> print len(input_file)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: object of type 'bz2.BZ2File' has no len()


So whatever your code actually is, I'm fairly sure it isn't what you say 
here.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-02 Thread Steven D'Aprano
Hi Colin,

I'm taking the liberty of replying to your message back to the list, as 
others hopefully may be able to make constructive comments. When 
replying, please ensure that you reply to the tutor mailing list rather 
than then individual.


On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote:

> > Without seeing your text file, and the code you use to read the text
> > file, there's no way of telling what is going on, but I can guess
> > the most likely causes:
>
> Since the file is 9.2 gig it wouldn't make sense to send it to you. 

And I am very glad you didn't try *smiles*

However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

> > (2) There's a bug in your code so that you stop reading after
> > 900,000 bytes.
> The code is simple enough that I'm pretty sure there is not a
> bug in it.
>
> import bz2
> input_file =
> bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print
> len(input_file)
>
> returns 90

I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:


>>> x = bz2.BZ2File('test.bz2', 'w')  # create a temporary file
>>> x.write("some data")
>>> x.close()
>>> input_file = bz2.BZ2File('test.bz2', 'r')  # open it
>>> print len(input_file)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: object of type 'bz2.BZ2File' has no len()


So whatever your code actually is, I'm fairly sure it isn't what you say 
here.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-02 Thread bob gailer

Please always reply-all so a copy goes to the list.

On 6/1/2010 6:49 PM, Colin Talbert wrote:


Bob thanks for your response,
The file is about 9.3 gig and no I don't want read the whole 
thing at once.  I want to read it in line by line.  Still it will read 
in to the same point (90 characters) and then act as if it came to 
the end of the file.  Below is the code I using for this:



import bz2

input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","rb")
for uline in input_file:
print linecount
linecount+=1








Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov



From:   bob gailer 
To: Colin Talbert 
Cc: tutor@python.org
Date:   06/01/2010 04:43 PM
Subject:    Re: [Tutor] parse text file






On 6/1/2010 5:40 PM, Colin Talbert wrote:

   I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a 
file it simple ends.  I did track down that file length is always 
90 so it appears to be related to some sort of buffer constraint.



Any other ideas?

How big is the file?

Is it necessary to read the entire thing at once?

Try opening with mode rb


import bz2

input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r")
try:
   all_data = input_file.read()
   print str(len(all_data))
finally:
   input_file.close()


--
Bob Gailer
919-636-4239
Chapel Hill NC

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread Steven D'Aprano
On Wed, 2 Jun 2010 07:40:33 am Colin Talbert wrote:
> I am also experiencing this same problem.  (Also on a OSM bz2
> file).  It appears to be working but then partway through reading a
> file it simple ends.  I did track down that file length is always
> 90 so it appears to be related to some sort of buffer constraint.

Without seeing your text file, and the code you use to read the text 
file, there's no way of telling what is going on, but I can guess the 
most likely causes:

(1) Your text file is actually only 900,000 bytes long, and so there's 
no problem at all.
(2) There's a bug in your code so that you stop reading after 900,000 
bytes.
(3) You're on Windows, and the text file contains an End-Of-File 
character ^Z after 900,000 bytes, and Windows supports that for 
backward compatibility with DOS.

And a distant (VERY distant) number 4, there's a bug in the 
implementation of read() in Python which somehow nobody has noticed 
before now.

As for your second issue, reading bz2 files:

> import bz2
>
> input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r")

You're opening a binary file in text mode. I'm pretty sure that is not 
going to work well. Try passing 'rb' as the mode instead.

> try:
> all_data = input_file.read()
> print str(len(all_data))

You don't need to call str() before calling print. print is perfectly 
happy to operate on integers:

print len(all_data)

will work.


-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread bob gailer

On 6/1/2010 5:40 PM, Colin Talbert wrote:


I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a 
file it simple ends.  I did track down that file length is always 
90 so it appears to be related to some sort of buffer constraint.



Any other ideas?


How big is the file?

Is it necessary to read the entire thing at once?

Try opening with mode rb



import bz2

input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r")
try:
all_data = input_file.read()
print str(len(all_data))
finally:
input_file.close()



--
Bob Gailer
919-636-4239
Chapel Hill NC

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread Colin Talbert
I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a file 
it simple ends.  I did track down that file length is always 90 so it 
appears to be related to some sort of buffer constraint.


Any other ideas?

import bz2

input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r")
try:
all_data = input_file.read()
print str(len(all_data))
finally:
input_file.close()






Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-03 Thread Norman Khine
On Tue, Feb 2, 2010 at 11:36 PM, Kent Johnson  wrote:
> On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine  wrote:
>> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson  wrote:
>
>>> Try this version:
>>>
>>> data = file.read()
>>>
>>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
>>> myIcon\n""", re.DOTALL).findall
>>> get_titles = re.compile(r"""(.*)<\/strong>""").findall
>>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
>>> get_latlngs = 
>>> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>>>
>>> then as before.
>>>
>>> Your repr() call is essentially removing newlines from the input by
>>> converting them to literal '\n' pairs. This allows your regex to work
>>> without the DOTALL modifier.
>>>
>>> Note you will get slightly different results with my version - it will
>>> give you correct utf-8 text for the titles whereas yours gives \
>>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
>>> version returns
>>>
>>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>>>
>>> Mine gives
>>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>>>
>>> This is showing the repr() of the title so they both have \ but note
>>> that yours has two \\ indicating that the \ is in the text; mine has
>>> only one \.
>>
>> i am no expert, but there seems to be a bigger difference.
>>
>> with repr(), i get:
>> Sat\\xe9re Maw\\xe9
>>
>> where as you get
>>
>> Sat\xc3\xa9re Maw\xc3\xa9
>>
>> repr()'s
>> é == \\xe9
>> whereas on your version
>> é == \xc3\xa9
>
> Right. Your version has four actual characters in the result - \, x,
> e, 9. This is the escaped representation of the unicode representation
> of e-acute. (The \ is doubled in the repr display.)
>
> My version has two bytes in the result, with the values c3 and a9.
> This is the utf-8 representation of e-acute.
>
> If you want to accurately represent (i.e. print) the title at some
> later time you probably want the utf-8 represetation.
>>
>>>
>>> Kent
>>>
>>
>> also, i still get an empty list when i run the code as suggested.
>
> You didn't change the regexes. You have to change \\t and \\n to \t
> and \n because the source text now has actual tabs and newlines, not
> the escaped representations.
>
> I know this is confusing, I'm sorry I don't have time or patience to
> explain more.

thanks for your time, i did realise after i posted the email that the
regex needed to be changed.

>
> Kent
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-03 Thread spir
On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine  wrote:

> i am no expert, but there seems to be a bigger difference.
> 
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
> 
> where as you get
> 
> Sat\xc3\xa9re Maw\xc3\xa9
> 
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

This is a rather complicated issue mixing python str, unicode string, and their 
repr().
Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted 
representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', 
which should only appear in a unicode string.
So:
   unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9"
or more simply:
   u"\u00e9".encode("utf8") == "\xc3\xa9"
Conversely:
   unicode("\xc3\xa9", "utf8") == u"\u00e9" -- decoding

The question is: what do you want to do with the result? You'll need either the 
utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for 
processing). But what you actually get is a kind of mix, actually the (python 
str) repr of a unicode string.

> also, i still get an empty list when i run the code as suggested.

? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching 
at \n by default)


Denis


la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine  wrote:
> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson  wrote:

>> Try this version:
>>
>> data = file.read()
>>
>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
>> myIcon\n""", re.DOTALL).findall
>> get_titles = re.compile(r"""(.*)<\/strong>""").findall
>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
>> get_latlngs = 
>> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>>
>> then as before.
>>
>> Your repr() call is essentially removing newlines from the input by
>> converting them to literal '\n' pairs. This allows your regex to work
>> without the DOTALL modifier.
>>
>> Note you will get slightly different results with my version - it will
>> give you correct utf-8 text for the titles whereas yours gives \
>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
>> version returns
>>
>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>>
>> Mine gives
>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>>
>> This is showing the repr() of the title so they both have \ but note
>> that yours has two \\ indicating that the \ is in the text; mine has
>> only one \.
>
> i am no expert, but there seems to be a bigger difference.
>
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
>
> where as you get
>
> Sat\xc3\xa9re Maw\xc3\xa9
>
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

Right. Your version has four actual characters in the result - \, x,
e, 9. This is the escaped representation of the unicode representation
of e-acute. (The \ is doubled in the repr display.)

My version has two bytes in the result, with the values c3 and a9.
This is the utf-8 representation of e-acute.

If you want to accurately represent (i.e. print) the title at some
later time you probably want the utf-8 represetation.
>
>>
>> Kent
>>
>
> also, i still get an empty list when i run the code as suggested.

You didn't change the regexes. You have to change \\t and \\n to \t
and \n because the source text now has actual tabs and newlines, not
the escaped representations.

I know this is confusing, I'm sorry I don't have time or patience to
explain more.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson  wrote:
> On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine  wrote:
>> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson  wrote:
>>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine  wrote:
 On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson  wrote:
> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:
>
> Why do you use repr() here?
>
>>>
>>> It smells of programming by guess rather than a correct solution to
>>> some problem. What happens if you take it out?
>>
>> when i take it out, i get an empty list.
>>
>> whereas both
>> data = repr( file.read().decode('latin-1') )
>> and
>> data = repr( file.read().decode('utf-8') )
>>
>> returns the full list.
>
> Try this version:
>
> data = file.read()
>
> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
> myIcon\n""", re.DOTALL).findall
> get_titles = re.compile(r"""(.*)<\/strong>""").findall
> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
> get_latlngs = 
> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>
> then as before.
>
> Your repr() call is essentially removing newlines from the input by
> converting them to literal '\n' pairs. This allows your regex to work
> without the DOTALL modifier.
>
> Note you will get slightly different results with my version - it will
> give you correct utf-8 text for the titles whereas yours gives \
> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
> version returns
>
> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>
> Mine gives
> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>
> This is showing the repr() of the title so they both have \ but note
> that yours has two \\ indicating that the \ is in the text; mine has
> only one \.

i am no expert, but there seems to be a bigger difference.

with repr(), i get:
Sat\\xe9re Maw\\xe9

where as you get

Sat\xc3\xa9re Maw\xc3\xa9

repr()'s
é == \\xe9
whereas on your version
é == \xc3\xa9

>
> Kent
>

also, i still get an empty list when i run the code as suggested.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine  wrote:
> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson  wrote:
>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine  wrote:
>>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson  wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:

 Why do you use repr() here?

>>
>> It smells of programming by guess rather than a correct solution to
>> some problem. What happens if you take it out?
>
> when i take it out, i get an empty list.
>
> whereas both
> data = repr( file.read().decode('latin-1') )
> and
> data = repr( file.read().decode('utf-8') )
>
> returns the full list.

Try this version:

data = file.read()

get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
myIcon\n""", re.DOTALL).findall
get_titles = re.compile(r"""(.*)<\/strong>""").findall
get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
get_latlngs = 
re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall

then as before.

Your repr() call is essentially removing newlines from the input by
converting them to literal '\n' pairs. This allows your regex to work
without the DOTALL modifier.

Note you will get slightly different results with my version - it will
give you correct utf-8 text for the titles whereas yours gives \
escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
version returns

{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

Mine gives
{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

This is showing the repr() of the title so they both have \ but note
that yours has two \\ indicating that the \ is in the text; mine has
only one \.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson  wrote:
> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine  wrote:
>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson  wrote:
>>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:
>>>
 here are the changes:

 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )
>>>
>>> Why do you use repr() here?
>>
>> i have latin-1 chars in the producers_google_map_code.txt' file and
>> this is the only way to get it to read the data.
>>
>> is this incorrect?
>
> Well, the repr() call is after the file read. If your data is latin-1
> you should decode it as latin-1, not utf-8:
> data = file.read().decode('latin-1')
>
> Though if the decode('utf-8') succeeds, and you do have non-ascii
> characters in the data, they are probably encoded in utf-8, not
> latin-1. Are you sure you have latin-1?
>
> The repr() call converts back to ascii text, maybe that is what you want?
>
> Perhaps you put in the repr because you were having trouble printing?
>
> It smells of programming by guess rather than a correct solution to
> some problem. What happens if you take it out?

when i take it out, i get an empty list.

whereas both
data = repr( file.read().decode('latin-1') )
and
data = repr( file.read().decode('utf-8') )

returns the full list.

here is the file
http://cdn.admgard.org/documents/producers_google_map_code.txt

>
> Kent
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine  wrote:
> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson  wrote:
>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:
>>
>>> here are the changes:
>>>
>>> import re
>>> file=open('producers_google_map_code.txt', 'r')
>>> data =  repr( file.read().decode('utf-8') )
>>
>> Why do you use repr() here?
>
> i have latin-1 chars in the producers_google_map_code.txt' file and
> this is the only way to get it to read the data.
>
> is this incorrect?

Well, the repr() call is after the file read. If your data is latin-1
you should decode it as latin-1, not utf-8:
data = file.read().decode('latin-1')

Though if the decode('utf-8') succeeds, and you do have non-ascii
characters in the data, they are probably encoded in utf-8, not
latin-1. Are you sure you have latin-1?

The repr() call converts back to ascii text, maybe that is what you want?

Perhaps you put in the repr because you were having trouble printing?

It smells of programming by guess rather than a correct solution to
some problem. What happens if you take it out?

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
hello,
thank you all for the advise, here is the updated version with the changes.

import re
file = open('producers_google_map_code.txt', 'r')
data = repr( file.read().decode('utf-8') )

get_records = re.compile(r"""openInfoWindowHtml\(.*?\\ticon:
myIcon\\n""").findall
get_titles = re.compile(r"""(.*)<\/strong>""").findall
get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
get_latlngs = 
re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""").findall

records = get_records(data)
block_record = []
for record in records:
namespace = {}
titles = get_titles(record)
title = titles[-1] if titles else None
urls = get_urls(record)
url = urls[-1] if urls else None
latlngs = get_latlngs(record)
latlng = latlngs[-1] if latlngs else None
block_record.append( {'title':title, 'url':url, 'lating':latlng} )

print block_record


On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson  wrote:
> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:
>
>> here are the changes:
>>
>> import re
>> file=open('producers_google_map_code.txt', 'r')
>> data =  repr( file.read().decode('utf-8') )
>
> Why do you use repr() here?

i have latin-1 chars in the producers_google_map_code.txt' file and
this is the only way to get it to read the data.

is this incorrect?

>
>> get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
>> get_title = re.compile(r"""(.*)<\/strong>""")
>> get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
>> get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>>
>> records = get_record.findall(data)
>> block_record = []
>> for record in records:
>>        namespace = {}
>>        titles = get_title.findall(record)
>>        for title in titles:
>>                namespace['title'] = title
>
>
> This is odd, you don't need a loop to get the last title, just use
>  namespace['title'] = get_title.findall(html)[-1]
>
> and similarly for url and latings.
>
> Kent
>
>
>>        urls = get_url.findall(record)
>>        for url in urls:
>>                namespace['url'] = url
>>        latlngs = get_latlng.findall(record)
>>        for latlng in latlngs:
>>                namespace['latlng'] = latlng
>>        block_record.append(namespace)
>>
>> print block_record
>>>
>>> The def of "namespace" would be clearer imo in a single line:
>>>    namespace = {title:t, url:url, lat:g}
>>
>> i am not sure how this will fit into the code!
>>
>>> This also reveals a kind of name confusion, doesn't it?
>>>
>>>
>>> Denis
>>>
>>>
>>>
>>>
>>> 
>>>
>>> la vita e estrany
>>>
>>> http://spir.wikidot.com/
>>> ___
>>> Tutor maillist  -  tu...@python.org
>>> To unsubscribe or change subscription options:
>>> http://mail.python.org/mailman/listinfo/tutor
>>>
>> ___
>> Tutor maillist  -  tu...@python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine  wrote:

> here are the changes:
>
> import re
> file=open('producers_google_map_code.txt', 'r')
> data =  repr( file.read().decode('utf-8') )

Why do you use repr() here?

> get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
> get_title = re.compile(r"""(.*)<\/strong>""")
> get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
> get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>
> records = get_record.findall(data)
> block_record = []
> for record in records:
>        namespace = {}
>        titles = get_title.findall(record)
>        for title in titles:
>                namespace['title'] = title


This is odd, you don't need a loop to get the last title, just use
  namespace['title'] = get_title.findall(html)[-1]

and similarly for url and latings.

Kent


>        urls = get_url.findall(record)
>        for url in urls:
>                namespace['url'] = url
>        latlngs = get_latlng.findall(record)
>        for latlng in latlngs:
>                namespace['latlng'] = latlng
>        block_record.append(namespace)
>
> print block_record
>>
>> The def of "namespace" would be clearer imo in a single line:
>>    namespace = {title:t, url:url, lat:g}
>
> i am not sure how this will fit into the code!
>
>> This also reveals a kind of name confusion, doesn't it?
>>
>>
>> Denis
>>
>>
>>
>>
>> 
>>
>> la vita e estrany
>>
>> http://spir.wikidot.com/
>> ___
>> Tutor maillist  -  tu...@python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
> ___
> Tutor maillist  -  tu...@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Dave Angel

Norman Khine wrote:

thanks denis,

On Tue, Feb 2, 2010 at 9:30 AM, spir  wrote:
  

On Mon, 1 Feb 2010 16:30:02 +0100
Norman Khine  wrote:



On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson  wrote:
  

On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine  wrote:



thanks, what about the whitespace problem?
  

\s* will match any amount of whitespace includin newlines.


thank you, this worked well.

here is the code:

###
import re
file=en('producers_google_map_code.txt', 'r')
data =repr( file.read().decode('utf-8') )

block =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
b =lock.findall(data)
block_list =]
for html in b:
  namespace =}
  t =e.compile(r"""(.*)<\/strong>""")
  title =.findall(html)
  for item in title:
  namespace['title'] =tem
  u =e.compile(r"""a href=\"\/(.*)\">En savoir plus""")
  url =.findall(html)
  for item in url:
  namespace['url'] =tem
  g =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
  lat =.findall(html)
  for item in lat:
  namespace['LatLng'] =tem
  block_list.append(namespace)

###

can this be made better?
  

The 3 regex patterns are constants: they can be put out of the loop.

You may also rename b to blocks, and find a more a more accurate name for 
block_list; eg block_records, where record =et of (named) fields.

A short desc and/or example of the overall and partial data formats can greatly 
help later review, since regex patterns alone are hard to decode.



here are the changes:

import re
file=en('producers_google_map_code.txt', 'r')
data =repr( file.read().decode('utf-8') )

get_record =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
get_title =e.compile(r"""(.*)<\/strong>""")
get_url =e.compile(r"""a href=\"\/(.*)\">En savoir plus""")
get_latlng =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")

records =et_record.findall(data)
block_record =]
for record in records:
namespace =}
titles =et_title.findall(record)
for title in titles:
namespace['title'] =itle
urls =et_url.findall(record)
for url in urls:
namespace['url'] =rl
latlngs =et_latlng.findall(record)
for latlng in latlngs:
namespace['latlng'] =atlng
block_record.append(namespace)

print block_record
  

The def of "namespace" would be clearer imo in a single line:
   namespace =title:t, url:url, lat:g}



i am not sure how this will fit into the code!

  

This also reveals a kind of name confusion, doesn't it?


Denis




Your variable 'file' is hiding a built-in name for the file type.  No 
harm in this example, but it's a bad habit to get into.


What did you intend to happen if the number of titles, urls, and latIngs 
are not each exactly one?  As you have it now, if there's more than one, 
you spend time adding them all to the dictionary, but only the last one 
survives.  And if there aren't any, you don't make an entry in the 
dictionary.


If that's the exact behavior you want, then you could replace the loop 
with an if statement:   (untested)


if titles:
namespace['title'] = titles[-1]


On the other hand, if you want a None in your dictionary for missing 
information, then something like:  (untested)


for record in records:


titles = get_title.findall(record)
title = titles[-1] if titles else None
urls = get_url.findall(record)
url = urls[-1] if urls else None
latlngs = get_latlng.findall(record)
lating = latings[-1] if latings else None
block_record.append( {'title':title, 'url':url, 'lating':lating{ )


DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Stefan Behnel
Norman Khine, 02.02.2010 10:16:
> get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
> get_title = re.compile(r"""(.*)<\/strong>""")
> get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
> get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
> 
> records = get_record.findall(data)
> block_record = []
> for record in records:
>   namespace = {}
>   titles = get_title.findall(record)
>   for title in titles:
>   namespace['title'] = title

I usually go one step further:

find_all_titles = re.compile(r"""(.*)<\/strong>""").findall
for record in records:
titles = find_all_titles(record)

Both faster and more readable (as is so common in Python).

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
thanks denis,

On Tue, Feb 2, 2010 at 9:30 AM, spir  wrote:
> On Mon, 1 Feb 2010 16:30:02 +0100
> Norman Khine  wrote:
>
>> On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson  wrote:
>> > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine  wrote:
>> >
>> >> thanks, what about the whitespace problem?
>> >
>> > \s* will match any amount of whitespace includin newlines.
>>
>> thank you, this worked well.
>>
>> here is the code:
>>
>> ###
>> import re
>> file=open('producers_google_map_code.txt', 'r')
>> data =  repr( file.read().decode('utf-8') )
>>
>> block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
>> b = block.findall(data)
>> block_list = []
>> for html in b:
>>       namespace = {}
>>       t = re.compile(r"""(.*)<\/strong>""")
>>       title = t.findall(html)
>>       for item in title:
>>               namespace['title'] = item
>>       u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
>>       url = u.findall(html)
>>       for item in url:
>>               namespace['url'] = item
>>       g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>>       lat = g.findall(html)
>>       for item in lat:
>>               namespace['LatLng'] = item
>>       block_list.append(namespace)
>>
>> ###
>>
>> can this be made better?
>
> The 3 regex patterns are constants: they can be put out of the loop.
>
> You may also rename b to blocks, and find a more a more accurate name for 
> block_list; eg block_records, where record = set of (named) fields.
>
> A short desc and/or example of the overall and partial data formats can 
> greatly help later review, since regex patterns alone are hard to decode.

here are the changes:

import re
file=open('producers_google_map_code.txt', 'r')
data =  repr( file.read().decode('utf-8') )

get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
get_title = re.compile(r"""(.*)<\/strong>""")
get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")

records = get_record.findall(data)
block_record = []
for record in records:
namespace = {}
titles = get_title.findall(record)
for title in titles:
namespace['title'] = title
urls = get_url.findall(record)
for url in urls:
namespace['url'] = url
latlngs = get_latlng.findall(record)
for latlng in latlngs:
namespace['latlng'] = latlng
block_record.append(namespace)

print block_record
>
> The def of "namespace" would be clearer imo in a single line:
>    namespace = {title:t, url:url, lat:g}

i am not sure how this will fit into the code!

> This also reveals a kind of name confusion, doesn't it?
>
>
> Denis
>
>
>
>
> 
>
> la vita e estrany
>
> http://spir.wikidot.com/
> ___
> Tutor maillist  -  tu...@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread spir
On Mon, 1 Feb 2010 16:30:02 +0100
Norman Khine  wrote:

> On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson  wrote:
> > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine  wrote:
> >
> >> thanks, what about the whitespace problem?
> >
> > \s* will match any amount of whitespace includin newlines.
> 
> thank you, this worked well.
> 
> here is the code:
> 
> ###
> import re
> file=open('producers_google_map_code.txt', 'r')
> data =  repr( file.read().decode('utf-8') )
> 
> block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
> b = block.findall(data)
> block_list = []
> for html in b:
>   namespace = {}
>   t = re.compile(r"""(.*)<\/strong>""")
>   title = t.findall(html)
>   for item in title:
>   namespace['title'] = item
>   u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
>   url = u.findall(html)
>   for item in url:
>   namespace['url'] = item
>   g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>   lat = g.findall(html)
>   for item in lat:
>   namespace['LatLng'] = item
>   block_list.append(namespace)
> 
> ###
> 
> can this be made better?

The 3 regex patterns are constants: they can be put out of the loop.

You may also rename b to blocks, and find a more a more accurate name for 
block_list; eg block_records, where record = set of (named) fields.

A short desc and/or example of the overall and partial data formats can greatly 
help later review, since regex patterns alone are hard to decode.

The def of "namespace" would be clearer imo in a single line:
namespace = {title:t, url:url, lat:g}
This also reveals a kind of name confusion, doesn't it?


Denis






la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Norman Khine
On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson  wrote:
> On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine  wrote:
>
>> thanks, what about the whitespace problem?
>
> \s* will match any amount of whitespace includin newlines.

thank you, this worked well.

here is the code:

###
import re
file=open('producers_google_map_code.txt', 'r')
data =  repr( file.read().decode('utf-8') )

block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
b = block.findall(data)
block_list = []
for html in b:
namespace = {}
t = re.compile(r"""(.*)<\/strong>""")
title = t.findall(html)
for item in title:
namespace['title'] = item
u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
url = u.findall(html)
for item in url:
namespace['url'] = item
g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
lat = g.findall(html)
for item in lat:
namespace['LatLng'] = item
block_list.append(namespace)

###

can this be made better?

>
> Kent
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Kent Johnson
On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine  wrote:

> thanks, what about the whitespace problem?

\s* will match any amount of whitespace includin newlines.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Norman Khine
On Mon, Feb 1, 2010 at 10:57 AM, spir  wrote:
> On Mon, 1 Feb 2010 00:43:59 +0100
> Norman Khine  wrote:
>
>> but this does not take into account of data which has negative values
>
> just add \-? in front of \d+

thanks, what about the whitespace problem?

>
> Denis
> 
>
> la vita e estrany
>
> http://spir.wikidot.com/
> ___
> Tutor maillist  -  tu...@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>



-- 
%>>> "".join( [ {'*':'@','^':'.'}.get(c,None) or
chr(97+(ord(c)-83)%26) for c in ",adym,*)&uzq^zqf" ] )
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread spir
On Mon, 1 Feb 2010 00:43:59 +0100
Norman Khine  wrote:

> but this does not take into account of data which has negative values

just add \-? in front of \d+

Denis


la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-01-31 Thread Norman Khine
Hello,
I am still unable to get this to work correctly!

In [1]: file=open('producers_google_map_code.txt', 'r')

In [2]: data =  repr( file.read().decode('utf-8') )

In [3]: from BeautifulSoup import BeautifulStoneSoup

In [4]: soup = BeautifulStoneSoup(data)

In [6]: soup

http://paste.lisp.org/display/94195

In [7]: import re

In [8]: p = re.compile(r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""")

In [9]: r = p.findall(data)

In [10]: r
Out[10]: []

see http://paste.lisp.org/+20BO/1

i can't seem to get the regex correct

(r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""")

the problem is that, each for example is:

GLatLng(27.729912,\\n  85.31559)
GLatLng(-18.889851,\\n  -66.770897)

i have a big whitespace, plus the group can have a negative value, so
if i do this:

In [31]: p = re.compile(r"""GLatLng\((\d+\.\d*)\,\\n
   (\d+\.\d*)\)""")

In [32]: r = p.findall(data)

In [33]: r
Out[33]:
[('27.729912', '85.31559'),
 ('9.696333', '122.985992'),
 ('17.964625', '102.60040'),
 ('21.046439', '105.853043'),

but this does not take into account of data which has negative values,
also i am unsure how to pull it all together. i.e. to return a CSV
file such as:

"ACP", "acp.html", "9.696333", "122.985992"
"ALTER TRADE CORPORATION", "alter-trade-corporation.html",
"-18.889851", "-66.770897"

Thanks


On Sat, Jan 23, 2010 at 12:50 AM, spir  wrote:
> On Sat, 23 Jan 2010 00:22:41 +0100
> Norman Khine  wrote:
>
>> Hi
>>
>> On Fri, Jan 22, 2010 at 7:44 PM, spir  wrote:
>> > On Fri, 22 Jan 2010 14:11:42 +0100
>> > Norman Khine  wrote:
>> >
>> >> but my problem comes when i try to list the GLatLng:
>> >>
>> >> GLatLng(9.696333, 122.985992);
>> >>
>> >> >>> StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
>> >> >>> StartingWithGLatLng
>> >> []
>> >
>> > Don't about soup's findall. But the regex pattern string should rather be 
>> > something like (untested):
>> >   r"""GLatLng\(\(d+\.\d*)\, (d+\.\d*)\) """
>> > capturing both integers.
>> >
>> > Denis
>> >
>> > PS: finally tested:
>> >
>> > import re
>> > s = "GLatLng(9.696333, 122.985992)"
>> > p = re.compile(r"""GLatLng\((\d+\.\d*)\, (\d+\.\d*)\)""")
>> > r = p.match(s)
>> > print r.group()         # --> GLatLng(9.696333, 122.985992)
>> > print r.groups()        # --> ('9.696333', '122.985992')
>> >
>> > s = "xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x"
>> > r = p.findall(s)
>> > print r                         # --> [('1.1', '11.22'), ('111.111', 
>> > '.')]
>>
>> Thanks for the help, but I can't seem to get the RegEx to work correctly.
>>
>> Here is my input and output:
>>
>> http://paste.lisp.org/+20BO/1
>
> See my previous examples...
> If you use match:
>
> In [6]: r = p.match(data)
>
> Then the result is a regex match object (unlike when using findall). To get 
> the string(s) matched; you need to use the group() and/or groups() methods.
>
 import re
 p = re.compile('x')
 print p.match("xabcx")
> <_sre.SRE_Match object at 0xb74de6e8>
 print p.findall("xabcx")
> ['x', 'x']
>
> Denis
> 
>
> la vita e estrany
>
> http://spir.wikidot.com/
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-01-22 Thread Norman Khine
Hi

On Fri, Jan 22, 2010 at 7:44 PM, spir  wrote:
> On Fri, 22 Jan 2010 14:11:42 +0100
> Norman Khine  wrote:
>
>> but my problem comes when i try to list the GLatLng:
>>
>> GLatLng(9.696333, 122.985992);
>>
>> >>> StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
>> >>> StartingWithGLatLng
>> []
>
> Don't about soup's findall. But the regex pattern string should rather be 
> something like (untested):
>   r"""GLatLng\(\(d+\.\d*)\, (d+\.\d*)\) """
> capturing both integers.
>
> Denis
>
> PS: finally tested:
>
> import re
> s = "GLatLng(9.696333, 122.985992)"
> p = re.compile(r"""GLatLng\((\d+\.\d*)\, (\d+\.\d*)\)""")
> r = p.match(s)
> print r.group()         # --> GLatLng(9.696333, 122.985992)
> print r.groups()        # --> ('9.696333', '122.985992')
>
> s = "xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x"
> r = p.findall(s)
> print r                         # --> [('1.1', '11.22'), ('111.111', 
> '.')]

Thanks for the help, but I can't seem to get the RegEx to work correctly.

Here is my input and output:

http://paste.lisp.org/+20BO/1

> 
>
> la vita e estrany
>
> http://spir.wikidot.com/
>



-- 
%>>> "".join( [ {'*':'@','^':'.'}.get(c,None) or
chr(97+(ord(c)-83)%26) for c in ",adym,*)&uzq^zqf" ] )
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-11 Thread Stefan Lesicnik
> > Hi Denis,
> >
> > Thanks for your input. So i decided i should use a pyparser and try it
> (im a
> > relative python noob though!)
>

Hi Everyone!

I have made some progress, although i believe it mainly due to luck and not
a lot of understanding (vague understanding maybe).

Hopefully this can help someone else out...


This is due to Combine(), that glues (back) together matched string bits. To
> work safely, it disables the default separator-skipping behaviour of
> pyparsing. So that
>   real = Combine(integral+fractional)
> would correctly not match "1 .2". Right?
> See a recent reply by Paul MacGuire about this topic on the pyparsing list
> http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-usersand
>  the pointer he gives there.
> There are several ways to correctly cope with that.
>

^ was a useful link - I still sometime struggle with the whitespaces and
combine / group...


Below is my code that works as I expect (i think...)


#!/usr/bin/python

import sys
from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
Combine, Literal, OneOrMore, SkipTo, printables, White

text='''
[04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
{CVE-2009-0023 CVE-2009-1955 CVE-2009-1243}
[etch] - apr-util 1.2.7+dfsg-2+etch2
[lenny] - apr-util 1.2.12+dfsg-8+lenny2
[01 Jun 2009] DSA-1808-1 drupal6 - insufficient input sanitising
{CVE-2009-1844}
[lenny] - drupal6 6.6-3lenny2
[01 Jun 2009] DSA-1807-1 cyrus-sasl2 cyrus-sasl2-heimdal - arbitrary code
execution
{CVE-2009-0688}
[lenny] - cyrus-sasl2-heimdal 2.1.22.dfsg1-23+lenny1
[lenny] - cyrus-sasl2 2.1.22.dfsg1-23+lenny1
[etch] - cyrus-sasl2 2.1.22.dfsg1-8+etch1
'''

lsquare = Literal('[')
rsquare = Literal(']')
lbrace = Literal('{')
rbrace = Literal('}')
dash = Literal('-')

space = White('\x20')
newline = White('\n')

spaceapp = White('\x20') + Literal('-') + White('\x20')
spaceseries = White('\t')

date = Combine(lsquare.suppress() + Word(nums, exact=2) + Word(alphas) +
Word(nums, exact=4) + rsquare.suppress(),adjacent=False,joinString='-')
dsa = Combine(Literal('DSA') + dash + Word(nums, exact=4) + dash +
Word(nums, exact=1))
app = Combine(Word(printables) + SkipTo(spaceapp))
desc = Combine(spaceapp.suppress() + ZeroOrMore(Word(alphas)) +
SkipTo(newline))
cve = Combine(lbrace.suppress() + OneOrMore(Literal('CVE') + dash +
Word(nums, exact=4) + dash + Word(nums, exact=4) + SkipTo(rbrace) +
Suppress(rbrace) + SkipTo(newline)))

series = OneOrMore(Group(lsquare.suppress() + OneOrMore(Literal('lenny') ^
Literal('etch') ^ Literal('sarge')) + rsquare.suppress() +
spaceapp.suppress() + Word(printables) + SkipTo(newline)))

record = date + dsa + app + desc + cve + series

def parse(text):
for data,dataStart,dataEnd in record.scanString(text):
yield data

for i in parse(text):
print i



My output is as follows

['04-Jun-2009', 'DSA-1812-1', 'apr-util', 'several vulnerabilities',
'CVE-2009-0023 CVE-2009-1955 CVE-2009-1243', ['etch', 'apr-util',
'1.2.7+dfsg-2+etch2'], ['lenny', 'apr-util', '1.2.12+dfsg-8+lenny2']]
['01-Jun-2009', 'DSA-1808-1', 'drupal6', 'insufficient input sanitising',
'CVE-2009-1844', ['lenny', 'drupal6', '6.6-3lenny2']]
['01-Jun-2009', 'DSA-1807-1', 'cyrus-sasl2 cyrus-sasl2-heimdal', 'arbitrary
code execution', 'CVE-2009-0688', ['lenny', 'cyrus-sasl2-heimdal',
'2.1.22.dfsg1-23+lenny1'], ['lenny', 'cyrus-sasl2',
'2.1.22.dfsg1-23+lenny1'], ['etch', 'cyrus-sasl2', '2.1.22.dfsg1-8+etch1']]


Thanks for everyone that offered assistance and prodding in right
directions.

Stefan
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-11 Thread spir
[Hope you don't mind I copy to the list. Not only it can help others, but 
pyparsing users read tutor, including Paul MacGuire (author).]

Le Thu, 11 Jun 2009 11:53:31 +0200,
Stefan Lesicnik  s'exprima ainsi:

[...]

I cannot really answer precisely for haven't used pyparsing for a while (*).

So, below are only some hints.

> Hi Denis,
> 
> Thanks for your input. So i decided i should use a pyparser and try it (im a
> relative python noob though!)
> 
> This is what i have so far...
> 
> import sys
> from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
> Combine, Literal, alphanums, Optional, OneOrMore, SkipTo, printables
> 
> text='''
> [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
> {CVE-2009-0023 CVE-2009-1955}
> [etch] - apr-util 1.2.7+dfsg-2+etch2
> [lenny] - apr-util 1.2.12+dfsg-8+lenny2
> '''
> 
> date = Combine(Literal('[') + Word(nums, exact=2) + Word(alphas) +
> Word(nums, exact=4) + Literal(']'),adjacent=False)
> dsa = Combine(Word(alphanums) + Literal('-') + Word(nums, exact=4) +
> Literal('-') + Word(nums, exact=1),adjacent=False)
> app = Combine(OneOrMore(Word(printables)) + SkipTo(Literal('-')))
> desc = Combine(Literal('-') + ZeroOrMore(Word(alphas)) +
> SkipTo(Literal('\n')))
> cve = Combine(Literal('{') + OneOrMore(Literal('CVE') + Literal('-') +
> Word(nums, exact=4) + Literal('-') + Word(nums, exact=4)) )
> 
> record = date + dsa + app + desc + cve
> 
> fields = record.parseString(text)
> #fields = dsa.parseString(text)
> print fields
> 
> 
> What i get out of this is
> 
> ['[04Jun2009]', 'DSA-1812-1', 'apr-util ', '- several vulnerabilities',
> '{CVE-2009-0023']
> 
> Which i guess it heading towards the right track...

For sure! Rather impressing you could write this so fast. Hope my littel PEG 
grammar helped.
There seems to be some detail issues, such as in the app pattern I would write
   ...+ SkipTo(Literal(' - '))
Also, you could directly Suppress() probably useless delimiters such as [...] 
in date.

Think at post-parse funcs to transform and/or reformat nodes: search for 
setParseAction() and addParseAction() in the doc.

> I am unsure why I am not getting more than 1 CVE... I have the OneOrMore
> match for the CVE stuff...

This is due to Combine(), that glues (back) together matched string bits. To 
work safely, it disables the default separator-skipping behaviour of pyparsing. 
So that
   real = Combine(integral+fractional)
would correctly not match "1 .2". Right?
See a recent reply by Paul MacGuire about this topic on the pyparsing list 
http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-users
 and the pointer he gives there.
There are several ways to correctly cope with that.

> That being said, how does the parser scale across multiple lines and how
> will it know that its finished?

Basically, you probably should express line breaks explicitely, esp. because 
they seem to be part of the source format.
Otherwise, there is a func or method to define which chars should be skipped as 
separators (default is sp/tab if I remember well).

> Should i maybe look at getting the list first into one entry per line? (must
> be easier to parse then?)

What makes sense I guess is Group()-ing items that *conceptually* build a list. 
In your case, I see:
* CVS items inside {...}
* version entry lines ("[etch]...", "[lenny]...", ...)
* whole records at a higher level

> This parsing is a mini language in itself!

Sure! A kind of rather big & complex parsing language. Hard to know it all well 
(and I don't even speak of all builtin helpers, and even less of all what you 
can do by mixing ordinary python code inside the grammar/parser: a whole new 
field in parsing/processing).

> Thanks for your input :)

My pleasure...

> Stefan

Denis

(*) The reason is I'm developping my own parsing tool; see 
http://spir.wikidot.com/pijnu.
The guide is also intended as a parsing tutorial, it may help, but is not 
exactly up-to-date.
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-10 Thread Eduardo Vieira
On Wed, Jun 10, 2009 at 12:44 PM, Stefan Lesicnik wrote:
> Hi Guys,
>
> I have the following text
>
> [08 Jun 2009] DSA-1813-1 evolution-data-server - several vulnerabilities
>     {CVE-2009-0547 CVE-2009-0582 CVE-2009-0587}
>     [etch] - evolution-data-server 1.6.3-5etch2
>     [lenny] - evolution-data-server 2.22.3-1.1+lenny1
> [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
>     {CVE-2009-0023 CVE-2009-1955}
>     [etch] - apr-util 1.2.7+dfsg-2+etch2
>     [lenny] - apr-util 1.2.12+dfsg-8+lenny2
>
> ... (and a whole lot more)
>
> I would like to parse this so I can get it into a format I can work with.
>
> I don't know anything about parsers, and my brief google has made me think
> im not sure I wan't to know about them quite yet!  :)
> (It looks very complex)
>
> For previous fixed string things, i would normally split each line and
> address each element, but this is not the case as there could be multiple
> [lenny] or even other entries.
>
> I would like to parse from the date to the next date and treat that all as
> one element (if that makes sense)
>
> Does anyone have any suggestions - should I be learning a parser for doing
> this? Or is there perhaps an easier way.
>
> Tia!
>
> Stefan
Hello, maybe if you would show a sample on how you would like the
ouput to look like it could help us give more suggestions.

Regards,

Eduardo
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor