Re: [Tutor] parse text file

2010-06-03 Thread Colin Talbert
Hello Steven,
Thanks for the reply.  Also this is my first post to tu...@python 
so I'll reply all in the future.


However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

I thought when you did a for uline in input_file each single line would go 
into memory independently, not the entire file.



I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:

You are so correct.  I'd been trying numerous things to read in this file 
and had deleted the code that I meant to put here and so wrote this from 
memory incorrectly.  The code that I wrote should have been:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)

Which indeed does return only 90.

Which is also the number returned when you sum the length of all the lines 
returned in a for line in file with:


import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
lengthz = lengthz + len(uline)

print lengthz


Thanks again for you help and sorry for the bad code in the previous 
submittal.


Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov




From:
Steven D'Aprano st...@pearwood.info
To:
tutor@python.org
Date:
06/02/2010 03:42 PM
Subject:
Re: [Tutor] parse text file
Sent by:
tutor-bounces+talbertc=usgs@python.org



Hi Colin,

I'm taking the liberty of replying to your message back to the list, as 
others hopefully may be able to make constructive comments. When 
replying, please ensure that you reply to the tutor mailing list rather 
than then individual.


On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote:

  Without seeing your text file, and the code you use to read the text
  file, there's no way of telling what is going on, but I can guess
  the most likely causes:

 Since the file is 9.2 gig it wouldn't make sense to send it to you. 

And I am very glad you didn't try *smiles*

However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

  (2) There's a bug in your code so that you stop reading after
  900,000 bytes.
 The code is simple enough that I'm pretty sure there is not a
 bug in it.

 import bz2
 input_file =
 bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print
 len(input_file)

 returns 90

I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:


 x = bz2.BZ2File('test.bz2', 'w')  # create a temporary file
 x.write(some data)
 x.close()
 input_file = bz2.BZ2File('test.bz2', 'r')  # open it
 print len(input_file)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: object of type 'bz2.BZ2File' has no len()


So whatever your code actually is, I'm fairly sure it isn't what you say 
here.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Alan Gauld


Colin Talbert talbe...@usgs.gov wrote

I thought when you did a for uline in input_file each single line 
would go

into memory independently, not the entire file.


Thats true but your code snippet showed you using read()
which reads the whole file...

I'm pretty sure that this is not your code, because you can't call 
len()

on a bz2 file. If you try, you get an error:

You are so correct.  I'd been trying numerous things to read in this 
file
and had deleted the code that I meant to put here and so wrote this 
from

memory incorrectly.  The code that I wrote should have been:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)


This again usees read() which reads the whole file.

Which is also the number returned when you sum the length of all the 
lines

returned in a for line in file with:

import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
   lengthz = lengthz + len(uline)


I'm not sure how

for line in file

will work for binary files. It may read the whole thing since
the concept of lines really only applies to text. So it may
be the same result as using read()

Try looping using read(n) where n is some buffer size
(1024 might be a good value?).

HTH,

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Dave Angel

Colin Talbert wrote:

snip
You are so correct.  I'd been trying numerous things to read in this file 
and had deleted the code that I meant to put here and so wrote this from 
memory incorrectly.  The code that I wrote should have been:


import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
str=input_file.read()
len(str)

Which indeed does return only 90.

Which is also the number returned when you sum the length of all the lines 
returned in a for line in file with:



import bz2
input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
lengthz = 0
for uline in input_file:
lengthz = lengthz + len(uline)

print lengthz

snip
  

Seems to me for such a large file you'd have to use 
bz2.BZ2Decompressor.  I have no experience with it, but its purpose is 
for sequential decompression -- decompression where not all the data is 
simultaneously available in memory.


DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Colin Talbert
Dave,
I think you are probably right about using decompressor.  I 
couldn't find any example of it in use and wasn't having any luck getting 
it to work based on the documentation.  Maybe I should try harder on this 
front.

Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov




From:
Dave Angel da...@ieee.org
To:
Colin Talbert talbe...@usgs.gov
Cc:
Steven D'Aprano st...@pearwood.info, tutor@python.org
Date:
06/03/2010 12:36 PM
Subject:
Re: [Tutor] parse text file



Colin Talbert wrote:
 snip
 You are so correct.  I'd been trying numerous things to read in this 
file 
 and had deleted the code that I meant to put here and so wrote this from 

 memory incorrectly.  The code that I wrote should have been:

 import bz2
 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
 str=input_file.read()
 len(str)

 Which indeed does return only 90.

 Which is also the number returned when you sum the length of all the 
lines 
 returned in a for line in file with:


 import bz2
 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
 lengthz = 0
 for uline in input_file:
 lengthz = lengthz + len(uline)

 print lengthz

 snip
 

Seems to me for such a large file you'd have to use 
bz2.BZ2Decompressor.  I have no experience with it, but its purpose is 
for sequential decompression -- decompression where not all the data is 
simultaneously available in memory.

DaveA



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Vincent Davis
On Thu, Jun 3, 2010 at 1:02 PM, Colin Talbert talbe...@usgs.gov wrote:


 Dave,
 I think you are probably right about using decompressor.  I
 couldn't find any example of it in use and wasn't having any luck getting it
 to work based on the documentation.  Maybe I should try harder on this
 front.


Is it possible write a python script to transfer this to a hdf5 file?  Would
this help?
Thanks
Vincent


 Colin Talbert
 GIS Specialist
 US Geological Survey - Fort Collins Science Center
 2150 Centre Ave. Bldg. C
 Fort Collins, CO 80526

 (970) 226-9425
 talbe...@usgs.gov



  From: Dave Angel da...@ieee.org To:
 Colin Talbert talbe...@usgs.gov
 Cc: Steven D'Aprano st...@pearwood.info, tutor@python.org Date: 06/03/2010
 12:36 PM Subject: Re: [Tutor] parse text file
 --



 Colin Talbert wrote:
  snip
  You are so correct.  I'd been trying numerous things to read in this file

  and had deleted the code that I meant to put here and so wrote this from
  memory incorrectly.  The code that I wrote should have been:
 
  import bz2
  input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
  str=input_file.read()
  len(str)
 
  Which indeed does return only 90.
 
  Which is also the number returned when you sum the length of all the
 lines
  returned in a for line in file with:
 
 
  import bz2
  input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
  lengthz = 0
  for uline in input_file:
  lengthz = lengthz + len(uline)
 
  print lengthz
 
  snip
 
 
 Seems to me for such a large file you'd have to use
 bz2.BZ2Decompressor.  I have no experience with it, but its purpose is
 for sequential decompression -- decompression where not all the data is
 simultaneously available in memory.

 DaveA




 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


  *Vincent Davis
720-301-3003 *
vinc...@vincentdavis.net
 my blog http://vincentdavis.net |
LinkedInhttp://www.linkedin.com/in/vincentdavis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Sander Sweers
On 3 June 2010 21:02, Colin Talbert talbe...@usgs.gov wrote:

 I couldn't find any example of it in use and wasn't having any luck getting
 it to work based on the documentation.


Good examples of the bz2 module can be found at [1].

greets
Sander

[1] http://www.doughellmann.com/PyMOTW/bz2/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-03 Thread Steven D'Aprano
On Fri, 4 Jun 2010 12:45:52 am Colin Talbert wrote:

 I thought when you did a for uline in input_file each single line
 would go into memory independently, not the entire file.

for line in file:

reads one line at a time, but file.read() tries to read everything in 
one go. However, it should fail with MemoryError, not just stop 
silently.

 I'm pretty sure that this is not your code, because you can't call
 len() on a bz2 file. If you try, you get an error:

 You are so correct.  I'd been trying numerous things to read in this
 file and had deleted the code that I meant to put here and so wrote
 this from memory incorrectly.  The code that I wrote should have
 been:

 import bz2
 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb')
 str=input_file.read()
 len(str)

 Which indeed does return only 90.


Unfortunately, I can't download your bz2 file myself to test it, but I 
think I *may* have found the problem. It looks like the current bz2 
module only supports files written as a single stream, and not multiple 
stream files. This is why the BZ2File class has no append mode. See 
this bug report:

http://bugs.python.org/issue1625

My hypothesis is that your bz2 file consists of either multiple streams, 
or multiple bz2 files concatenated together, and the BZ2File class 
stops reading after the first.

I can test my hypothesis:

 bz2.BZ2File('a.bz2', 'w').write('this is the first chunk of text')
 bz2.BZ2File('b.bz2', 'w').write('this is the second chunk of text')
 bz2.BZ2File('c.bz2', 'w').write('this is the third chunk of text')
 # concatenate the files
... d = file('concate.bz2', 'w')
 for name in abc:
... f = file('%c.bz2' % name, 'rb')
... d.write(f.read())
...
 d.close()

 bz2.BZ2File('concate.bz2', 'r').read()
'this is the first chunk of text'

And sure enough, BZ2File only sees the first chunk of text!

But if I open it in a stand-alone bz2 utility (I use the Linux 
application Ark), I can see all three chunks of text. So I think we 
have a successful test of the hypothesis.


Assuming this is the problem you are having, you have a number of 
possible solutions:

(1) Re-create the bz2 file from a single stream.

(2) Use another application to expand the bz2 file and then read 
directly from that, skipping BZ2File altogether.

(3) Upgrade to Python 2.7 or 3.2, and hope the patch is applied.

(4) Backport the patch to your version of Python and apply it yourself.

(5) Write your own bz2 utility.

Not really a very appetising series of choices there, I must admit. 
Probably (1) or (2) are the least worst.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-02 Thread bob gailer

Please always reply-all so a copy goes to the list.

On 6/1/2010 6:49 PM, Colin Talbert wrote:


Bob thanks for your response,
The file is about 9.3 gig and no I don't want read the whole 
thing at once.  I want to read it in line by line.  Still it will read 
in to the same point (90 characters) and then act as if it came to 
the end of the file.  Below is the code I using for this:



import bz2

input_file = bz2.BZ2File(rC:\temp\planet-latest.osm.bz2,rb)
for uline in input_file:
print linecount
linecount+=1








Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov



From:   bob gailer bgai...@gmail.com
To: Colin Talbert talbe...@usgs.gov
Cc: tutor@python.org
Date:   06/01/2010 04:43 PM
Subject:Re: [Tutor] parse text file






On 6/1/2010 5:40 PM, Colin Talbert wrote:

   I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a 
file it simple ends.  I did track down that file length is always 
90 so it appears to be related to some sort of buffer constraint.



Any other ideas?

How big is the file?

Is it necessary to read the entire thing at once?

Try opening with mode rb


import bz2

input_file = bz2.BZ2File(rC:\temp\planet-latest.osm.bz2,r)
try:
   all_data = input_file.read()
   print str(len(all_data))
finally:
   input_file.close()


--
Bob Gailer
919-636-4239
Chapel Hill NC

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-02 Thread Steven D'Aprano
Hi Colin,

I'm taking the liberty of replying to your message back to the list, as 
others hopefully may be able to make constructive comments. When 
replying, please ensure that you reply to the tutor mailing list rather 
than then individual.


On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote:

  Without seeing your text file, and the code you use to read the text
  file, there's no way of telling what is going on, but I can guess
  the most likely causes:

 Since the file is 9.2 gig it wouldn't make sense to send it to you. 

And I am very glad you didn't try *smiles*

However, a file of that size changes things drastically. You can't 
expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file 
into memory at once, let along the unpacked 131 GB text file, EVEN if 
your computer has more than 9.2 GB of memory. So your tests need to 
take this into account.

  (2) There's a bug in your code so that you stop reading after
  900,000 bytes.
 The code is simple enough that I'm pretty sure there is not a
 bug in it.

 import bz2
 input_file =
 bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print
 len(input_file)

 returns 90

I'm pretty sure that this is not your code, because you can't call len() 
on a bz2 file. If you try, you get an error:


 x = bz2.BZ2File('test.bz2', 'w')  # create a temporary file
 x.write(some data)
 x.close()
 input_file = bz2.BZ2File('test.bz2', 'r')  # open it
 print len(input_file)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: object of type 'bz2.BZ2File' has no len()


So whatever your code actually is, I'm fairly sure it isn't what you say 
here.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread Colin Talbert
I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a file 
it simple ends.  I did track down that file length is always 90 so it 
appears to be related to some sort of buffer constraint.


Any other ideas?

import bz2

input_file = bz2.BZ2File(rC:\temp\planet-latest.osm.bz2,r)
try:
all_data = input_file.read()
print str(len(all_data))
finally:
input_file.close()






Colin Talbert
GIS Specialist
US Geological Survey - Fort Collins Science Center
2150 Centre Ave. Bldg. C
Fort Collins, CO 80526

(970) 226-9425
talbe...@usgs.gov
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread bob gailer

On 6/1/2010 5:40 PM, Colin Talbert wrote:


I am also experiencing this same problem.  (Also on a OSM bz2 
file).  It appears to be working but then partway through reading a 
file it simple ends.  I did track down that file length is always 
90 so it appears to be related to some sort of buffer constraint.



Any other ideas?


How big is the file?

Is it necessary to read the entire thing at once?

Try opening with mode rb



import bz2

input_file = bz2.BZ2File(rC:\temp\planet-latest.osm.bz2,r)
try:
all_data = input_file.read()
print str(len(all_data))
finally:
input_file.close()



--
Bob Gailer
919-636-4239
Chapel Hill NC

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-06-01 Thread Steven D'Aprano
On Wed, 2 Jun 2010 07:40:33 am Colin Talbert wrote:
 I am also experiencing this same problem.  (Also on a OSM bz2
 file).  It appears to be working but then partway through reading a
 file it simple ends.  I did track down that file length is always
 90 so it appears to be related to some sort of buffer constraint.

Without seeing your text file, and the code you use to read the text 
file, there's no way of telling what is going on, but I can guess the 
most likely causes:

(1) Your text file is actually only 900,000 bytes long, and so there's 
no problem at all.
(2) There's a bug in your code so that you stop reading after 900,000 
bytes.
(3) You're on Windows, and the text file contains an End-Of-File 
character ^Z after 900,000 bytes, and Windows supports that for 
backward compatibility with DOS.

And a distant (VERY distant) number 4, there's a bug in the 
implementation of read() in Python which somehow nobody has noticed 
before now.

As for your second issue, reading bz2 files:

 import bz2

 input_file = bz2.BZ2File(rC:\temp\planet-latest.osm.bz2,r)

You're opening a binary file in text mode. I'm pretty sure that is not 
going to work well. Try passing 'rb' as the mode instead.

 try:
 all_data = input_file.read()
 print str(len(all_data))

You don't need to call str() before calling print. print is perfectly 
happy to operate on integers:

print len(all_data)

will work.


-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-03 Thread spir
On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine nor...@khine.net wrote:

 i am no expert, but there seems to be a bigger difference.
 
 with repr(), i get:
 Sat\\xe9re Maw\\xe9
 
 where as you get
 
 Sat\xc3\xa9re Maw\xc3\xa9
 
 repr()'s
 é == \\xe9
 whereas on your version
 é == \xc3\xa9

This is a rather complicated issue mixing python str, unicode string, and their 
repr().
Kent is right in that the *python string* \xc3\xa9 is the utf8 formatted 
representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', 
which should only appear in a unicode string.
So:
   unicode.encode(u\u00e9, utf8) == \xc3\xa9
or more simply:
   u\u00e9.encode(utf8) == \xc3\xa9
Conversely:
   unicode(\xc3\xa9, utf8) == u\u00e9 -- decoding

The question is: what do you want to do with the result? You'll need either the 
utf8 form \xc3\xa9 (for output) or the unicode string u\u00e9 (for 
processing). But what you actually get is a kind of mix, actually the (python 
str) repr of a unicode string.

 also, i still get an empty list when i run the code as suggested.

? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching 
at \n by default)


Denis


la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-03 Thread Norman Khine
On Tue, Feb 2, 2010 at 11:36 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson ken...@tds.net wrote:

 Try this version:

 data = file.read()

 get_records = re.compile(ropenInfoWindowHtml\(.*?\ticon:
 myIcon\n, re.DOTALL).findall
 get_titles = re.compile(rstrong(.*)\/strong).findall
 get_urls = re.compile(ra href=\\/(.*)\En savoir plus).findall
 get_latlngs = 
 re.compile(rGLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)).findall

 then as before.

 Your repr() call is essentially removing newlines from the input by
 converting them to literal '\n' pairs. This allows your regex to work
 without the DOTALL modifier.

 Note you will get slightly different results with my version - it will
 give you correct utf-8 text for the titles whereas yours gives \
 escapes. For example one of the titles is CGTSM (Satére Mawé). Your
 version returns

 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

 Mine gives
 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

 This is showing the repr() of the title so they both have \ but note
 that yours has two \\ indicating that the \ is in the text; mine has
 only one \.

 i am no expert, but there seems to be a bigger difference.

 with repr(), i get:
 Sat\\xe9re Maw\\xe9

 where as you get

 Sat\xc3\xa9re Maw\xc3\xa9

 repr()'s
 é == \\xe9
 whereas on your version
 é == \xc3\xa9

 Right. Your version has four actual characters in the result - \, x,
 e, 9. This is the escaped representation of the unicode representation
 of e-acute. (The \ is doubled in the repr display.)

 My version has two bytes in the result, with the values c3 and a9.
 This is the utf-8 representation of e-acute.

 If you want to accurately represent (i.e. print) the title at some
 later time you probably want the utf-8 represetation.


 Kent


 also, i still get an empty list when i run the code as suggested.

 You didn't change the regexes. You have to change \\t and \\n to \t
 and \n because the source text now has actual tabs and newlines, not
 the escaped representations.

 I know this is confusing, I'm sorry I don't have time or patience to
 explain more.

thanks for your time, i did realise after i posted the email that the
regex needed to be changed.


 Kent

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread spir
On Mon, 1 Feb 2010 16:30:02 +0100
Norman Khine nor...@khine.net wrote:

 On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson ken...@tds.net wrote:
  On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine nor...@khine.net wrote:
 
  thanks, what about the whitespace problem?
 
  \s* will match any amount of whitespace includin newlines.
 
 thank you, this worked well.
 
 here is the code:
 
 ###
 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )
 
 block = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
 b = block.findall(data)
 block_list = []
 for html in b:
   namespace = {}
   t = re.compile(rstrong(.*)\/strong)
   title = t.findall(html)
   for item in title:
   namespace['title'] = item
   u = re.compile(ra href=\\/(.*)\En savoir plus)
   url = u.findall(html)
   for item in url:
   namespace['url'] = item
   g = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))
   lat = g.findall(html)
   for item in lat:
   namespace['LatLng'] = item
   block_list.append(namespace)
 
 ###
 
 can this be made better?

The 3 regex patterns are constants: they can be put out of the loop.

You may also rename b to blocks, and find a more a more accurate name for 
block_list; eg block_records, where record = set of (named) fields.

A short desc and/or example of the overall and partial data formats can greatly 
help later review, since regex patterns alone are hard to decode.

The def of namespace would be clearer imo in a single line:
namespace = {title:t, url:url, lat:g}
This also reveals a kind of name confusion, doesn't it?


Denis






la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
thanks denis,

On Tue, Feb 2, 2010 at 9:30 AM, spir denis.s...@free.fr wrote:
 On Mon, 1 Feb 2010 16:30:02 +0100
 Norman Khine nor...@khine.net wrote:

 On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson ken...@tds.net wrote:
  On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine nor...@khine.net wrote:
 
  thanks, what about the whitespace problem?
 
  \s* will match any amount of whitespace includin newlines.

 thank you, this worked well.

 here is the code:

 ###
 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )

 block = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
 b = block.findall(data)
 block_list = []
 for html in b:
       namespace = {}
       t = re.compile(rstrong(.*)\/strong)
       title = t.findall(html)
       for item in title:
               namespace['title'] = item
       u = re.compile(ra href=\\/(.*)\En savoir plus)
       url = u.findall(html)
       for item in url:
               namespace['url'] = item
       g = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))
       lat = g.findall(html)
       for item in lat:
               namespace['LatLng'] = item
       block_list.append(namespace)

 ###

 can this be made better?

 The 3 regex patterns are constants: they can be put out of the loop.

 You may also rename b to blocks, and find a more a more accurate name for 
 block_list; eg block_records, where record = set of (named) fields.

 A short desc and/or example of the overall and partial data formats can 
 greatly help later review, since regex patterns alone are hard to decode.

here are the changes:

import re
file=open('producers_google_map_code.txt', 'r')
data =  repr( file.read().decode('utf-8') )

get_record = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
get_title = re.compile(rstrong(.*)\/strong)
get_url = re.compile(ra href=\\/(.*)\En savoir plus)
get_latlng = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))

records = get_record.findall(data)
block_record = []
for record in records:
namespace = {}
titles = get_title.findall(record)
for title in titles:
namespace['title'] = title
urls = get_url.findall(record)
for url in urls:
namespace['url'] = url
latlngs = get_latlng.findall(record)
for latlng in latlngs:
namespace['latlng'] = latlng
block_record.append(namespace)

print block_record

 The def of namespace would be clearer imo in a single line:
    namespace = {title:t, url:url, lat:g}

i am not sure how this will fit into the code!

 This also reveals a kind of name confusion, doesn't it?


 Denis




 

 la vita e estrany

 http://spir.wikidot.com/
 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Stefan Behnel
Norman Khine, 02.02.2010 10:16:
 get_record = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
 get_title = re.compile(rstrong(.*)\/strong)
 get_url = re.compile(ra href=\\/(.*)\En savoir plus)
 get_latlng = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))
 
 records = get_record.findall(data)
 block_record = []
 for record in records:
   namespace = {}
   titles = get_title.findall(record)
   for title in titles:
   namespace['title'] = title

I usually go one step further:

find_all_titles = re.compile(rstrong(.*)\/strong).findall
for record in records:
titles = find_all_titles(record)

Both faster and more readable (as is so common in Python).

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Dave Angel

Norman Khine wrote:

thanks denis,

On Tue, Feb 2, 2010 at 9:30 AM, spir denis.s...@free.fr wrote:
  

On Mon, 1 Feb 2010 16:30:02 +0100
Norman Khine nor...@khine.net wrote:



On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson ken...@tds.net wrote:
  

On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine nor...@khine.net wrote:



thanks, what about the whitespace problem?
  

\s* will match any amount of whitespace includin newlines.


thank you, this worked well.

here is the code:

###
import re
file=en('producers_google_map_code.txt', 'r')
data =repr( file.read().decode('utf-8') )

block =e.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
b =lock.findall(data)
block_list =]
for html in b:
  namespace =}
  t =e.compile(rstrong(.*)\/strong)
  title =.findall(html)
  for item in title:
  namespace['title'] =tem
  u =e.compile(ra href=\\/(.*)\En savoir plus)
  url =.findall(html)
  for item in url:
  namespace['url'] =tem
  g =e.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))
  lat =.findall(html)
  for item in lat:
  namespace['LatLng'] =tem
  block_list.append(namespace)

###

can this be made better?
  

The 3 regex patterns are constants: they can be put out of the loop.

You may also rename b to blocks, and find a more a more accurate name for 
block_list; eg block_records, where record =et of (named) fields.

A short desc and/or example of the overall and partial data formats can greatly 
help later review, since regex patterns alone are hard to decode.



here are the changes:

import re
file=en('producers_google_map_code.txt', 'r')
data =repr( file.read().decode('utf-8') )

get_record =e.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
get_title =e.compile(rstrong(.*)\/strong)
get_url =e.compile(ra href=\\/(.*)\En savoir plus)
get_latlng =e.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))

records =et_record.findall(data)
block_record =]
for record in records:
namespace =}
titles =et_title.findall(record)
for title in titles:
namespace['title'] =itle
urls =et_url.findall(record)
for url in urls:
namespace['url'] =rl
latlngs =et_latlng.findall(record)
for latlng in latlngs:
namespace['latlng'] =atlng
block_record.append(namespace)

print block_record
  

The def of namespace would be clearer imo in a single line:
   namespace =title:t, url:url, lat:g}



i am not sure how this will fit into the code!

  

This also reveals a kind of name confusion, doesn't it?


Denis




Your variable 'file' is hiding a built-in name for the file type.  No 
harm in this example, but it's a bad habit to get into.


What did you intend to happen if the number of titles, urls, and latIngs 
are not each exactly one?  As you have it now, if there's more than one, 
you spend time adding them all to the dictionary, but only the last one 
survives.  And if there aren't any, you don't make an entry in the 
dictionary.


If that's the exact behavior you want, then you could replace the loop 
with an if statement:   (untested)


if titles:
namespace['title'] = titles[-1]


On the other hand, if you want a None in your dictionary for missing 
information, then something like:  (untested)


for record in records:


titles = get_title.findall(record)
title = titles[-1] if titles else None
urls = get_url.findall(record)
url = urls[-1] if urls else None
latlngs = get_latlng.findall(record)
lating = latings[-1] if latings else None
block_record.append( {'title':title, 'url':url, 'lating':lating{ )


DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 here are the changes:

 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )

Why do you use repr() here?

 get_record = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
 get_title = re.compile(rstrong(.*)\/strong)
 get_url = re.compile(ra href=\\/(.*)\En savoir plus)
 get_latlng = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))

 records = get_record.findall(data)
 block_record = []
 for record in records:
        namespace = {}
        titles = get_title.findall(record)
        for title in titles:
                namespace['title'] = title


This is odd, you don't need a loop to get the last title, just use
  namespace['title'] = get_title.findall(html)[-1]

and similarly for url and latings.

Kent


        urls = get_url.findall(record)
        for url in urls:
                namespace['url'] = url
        latlngs = get_latlng.findall(record)
        for latlng in latlngs:
                namespace['latlng'] = latlng
        block_record.append(namespace)

 print block_record

 The def of namespace would be clearer imo in a single line:
    namespace = {title:t, url:url, lat:g}

 i am not sure how this will fit into the code!

 This also reveals a kind of name confusion, doesn't it?


 Denis




 

 la vita e estrany

 http://spir.wikidot.com/
 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
hello,
thank you all for the advise, here is the updated version with the changes.

import re
file = open('producers_google_map_code.txt', 'r')
data = repr( file.read().decode('utf-8') )

get_records = re.compile(ropenInfoWindowHtml\(.*?\\ticon:
myIcon\\n).findall
get_titles = re.compile(rstrong(.*)\/strong).findall
get_urls = re.compile(ra href=\\/(.*)\En savoir plus).findall
get_latlngs = 
re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)).findall

records = get_records(data)
block_record = []
for record in records:
namespace = {}
titles = get_titles(record)
title = titles[-1] if titles else None
urls = get_urls(record)
url = urls[-1] if urls else None
latlngs = get_latlngs(record)
latlng = latlngs[-1] if latlngs else None
block_record.append( {'title':title, 'url':url, 'lating':latlng} )

print block_record


On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 here are the changes:

 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )

 Why do you use repr() here?

i have latin-1 chars in the producers_google_map_code.txt' file and
this is the only way to get it to read the data.

is this incorrect?


 get_record = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
 get_title = re.compile(rstrong(.*)\/strong)
 get_url = re.compile(ra href=\\/(.*)\En savoir plus)
 get_latlng = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))

 records = get_record.findall(data)
 block_record = []
 for record in records:
        namespace = {}
        titles = get_title.findall(record)
        for title in titles:
                namespace['title'] = title


 This is odd, you don't need a loop to get the last title, just use
  namespace['title'] = get_title.findall(html)[-1]

 and similarly for url and latings.

 Kent


        urls = get_url.findall(record)
        for url in urls:
                namespace['url'] = url
        latlngs = get_latlng.findall(record)
        for latlng in latlngs:
                namespace['latlng'] = latlng
        block_record.append(namespace)

 print block_record

 The def of namespace would be clearer imo in a single line:
    namespace = {title:t, url:url, lat:g}

 i am not sure how this will fit into the code!

 This also reveals a kind of name confusion, doesn't it?


 Denis




 

 la vita e estrany

 http://spir.wikidot.com/
 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 here are the changes:

 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )

 Why do you use repr() here?

 i have latin-1 chars in the producers_google_map_code.txt' file and
 this is the only way to get it to read the data.

 is this incorrect?

Well, the repr() call is after the file read. If your data is latin-1
you should decode it as latin-1, not utf-8:
data = file.read().decode('latin-1')

Though if the decode('utf-8') succeeds, and you do have non-ascii
characters in the data, they are probably encoded in utf-8, not
latin-1. Are you sure you have latin-1?

The repr() call converts back to ascii text, maybe that is what you want?

Perhaps you put in the repr because you were having trouble printing?

It smells of programming by guess rather than a correct solution to
some problem. What happens if you take it out?

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 here are the changes:

 import re
 file=open('producers_google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )

 Why do you use repr() here?

 i have latin-1 chars in the producers_google_map_code.txt' file and
 this is the only way to get it to read the data.

 is this incorrect?

 Well, the repr() call is after the file read. If your data is latin-1
 you should decode it as latin-1, not utf-8:
 data = file.read().decode('latin-1')

 Though if the decode('utf-8') succeeds, and you do have non-ascii
 characters in the data, they are probably encoded in utf-8, not
 latin-1. Are you sure you have latin-1?

 The repr() call converts back to ascii text, maybe that is what you want?

 Perhaps you put in the repr because you were having trouble printing?

 It smells of programming by guess rather than a correct solution to
 some problem. What happens if you take it out?

when i take it out, i get an empty list.

whereas both
data = repr( file.read().decode('latin-1') )
and
data = repr( file.read().decode('utf-8') )

returns the full list.

here is the file
http://cdn.admgard.org/documents/producers_google_map_code.txt


 Kent

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 Why do you use repr() here?


 It smells of programming by guess rather than a correct solution to
 some problem. What happens if you take it out?

 when i take it out, i get an empty list.

 whereas both
 data = repr( file.read().decode('latin-1') )
 and
 data = repr( file.read().decode('utf-8') )

 returns the full list.

Try this version:

data = file.read()

get_records = re.compile(ropenInfoWindowHtml\(.*?\ticon:
myIcon\n, re.DOTALL).findall
get_titles = re.compile(rstrong(.*)\/strong).findall
get_urls = re.compile(ra href=\\/(.*)\En savoir plus).findall
get_latlngs = 
re.compile(rGLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)).findall

then as before.

Your repr() call is essentially removing newlines from the input by
converting them to literal '\n' pairs. This allows your regex to work
without the DOTALL modifier.

Note you will get slightly different results with my version - it will
give you correct utf-8 text for the titles whereas yours gives \
escapes. For example one of the titles is CGTSM (Satére Mawé). Your
version returns

{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

Mine gives
{'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
'-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

This is showing the repr() of the title so they both have \ but note
that yours has two \\ indicating that the \ is in the text; mine has
only one \.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Norman Khine
On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson ken...@tds.net wrote:
 On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine nor...@khine.net wrote:

 Why do you use repr() here?


 It smells of programming by guess rather than a correct solution to
 some problem. What happens if you take it out?

 when i take it out, i get an empty list.

 whereas both
 data = repr( file.read().decode('latin-1') )
 and
 data = repr( file.read().decode('utf-8') )

 returns the full list.

 Try this version:

 data = file.read()

 get_records = re.compile(ropenInfoWindowHtml\(.*?\ticon:
 myIcon\n, re.DOTALL).findall
 get_titles = re.compile(rstrong(.*)\/strong).findall
 get_urls = re.compile(ra href=\\/(.*)\En savoir plus).findall
 get_latlngs = 
 re.compile(rGLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)).findall

 then as before.

 Your repr() call is essentially removing newlines from the input by
 converting them to literal '\n' pairs. This allows your regex to work
 without the DOTALL modifier.

 Note you will get slightly different results with my version - it will
 give you correct utf-8 text for the titles whereas yours gives \
 escapes. For example one of the titles is CGTSM (Satére Mawé). Your
 version returns

 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

 Mine gives
 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

 This is showing the repr() of the title so they both have \ but note
 that yours has two \\ indicating that the \ is in the text; mine has
 only one \.

i am no expert, but there seems to be a bigger difference.

with repr(), i get:
Sat\\xe9re Maw\\xe9

where as you get

Sat\xc3\xa9re Maw\xc3\xa9

repr()'s
é == \\xe9
whereas on your version
é == \xc3\xa9


 Kent


also, i still get an empty list when i run the code as suggested.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-02 Thread Kent Johnson
On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine nor...@khine.net wrote:
 On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson ken...@tds.net wrote:

 Try this version:

 data = file.read()

 get_records = re.compile(ropenInfoWindowHtml\(.*?\ticon:
 myIcon\n, re.DOTALL).findall
 get_titles = re.compile(rstrong(.*)\/strong).findall
 get_urls = re.compile(ra href=\\/(.*)\En savoir plus).findall
 get_latlngs = 
 re.compile(rGLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)).findall

 then as before.

 Your repr() call is essentially removing newlines from the input by
 converting them to literal '\n' pairs. This allows your regex to work
 without the DOTALL modifier.

 Note you will get slightly different results with my version - it will
 give you correct utf-8 text for the titles whereas yours gives \
 escapes. For example one of the titles is CGTSM (Satére Mawé). Your
 version returns

 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}

 Mine gives
 {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
 '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}

 This is showing the repr() of the title so they both have \ but note
 that yours has two \\ indicating that the \ is in the text; mine has
 only one \.

 i am no expert, but there seems to be a bigger difference.

 with repr(), i get:
 Sat\\xe9re Maw\\xe9

 where as you get

 Sat\xc3\xa9re Maw\xc3\xa9

 repr()'s
 é == \\xe9
 whereas on your version
 é == \xc3\xa9

Right. Your version has four actual characters in the result - \, x,
e, 9. This is the escaped representation of the unicode representation
of e-acute. (The \ is doubled in the repr display.)

My version has two bytes in the result, with the values c3 and a9.
This is the utf-8 representation of e-acute.

If you want to accurately represent (i.e. print) the title at some
later time you probably want the utf-8 represetation.


 Kent


 also, i still get an empty list when i run the code as suggested.

You didn't change the regexes. You have to change \\t and \\n to \t
and \n because the source text now has actual tabs and newlines, not
the escaped representations.

I know this is confusing, I'm sorry I don't have time or patience to
explain more.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread spir
On Mon, 1 Feb 2010 00:43:59 +0100
Norman Khine nor...@khine.net wrote:

 but this does not take into account of data which has negative values

just add \-? in front of \d+

Denis


la vita e estrany

http://spir.wikidot.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Norman Khine
On Mon, Feb 1, 2010 at 10:57 AM, spir denis.s...@free.fr wrote:
 On Mon, 1 Feb 2010 00:43:59 +0100
 Norman Khine nor...@khine.net wrote:

 but this does not take into account of data which has negative values

 just add \-? in front of \d+

thanks, what about the whitespace problem?


 Denis
 

 la vita e estrany

 http://spir.wikidot.com/
 ___
 Tutor maillist  -  tu...@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor




-- 
% .join( [ {'*':'@','^':'.'}.get(c,None) or
chr(97+(ord(c)-83)%26) for c in ,adym,*)uzq^zqf ] )
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Kent Johnson
On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine nor...@khine.net wrote:

 thanks, what about the whitespace problem?

\s* will match any amount of whitespace includin newlines.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-02-01 Thread Norman Khine
On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson ken...@tds.net wrote:
 On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine nor...@khine.net wrote:

 thanks, what about the whitespace problem?

 \s* will match any amount of whitespace includin newlines.

thank you, this worked well.

here is the code:

###
import re
file=open('producers_google_map_code.txt', 'r')
data =  repr( file.read().decode('utf-8') )

block = re.compile(ropenInfoWindowHtml\(.*?\\ticon: myIcon\\n)
b = block.findall(data)
block_list = []
for html in b:
namespace = {}
t = re.compile(rstrong(.*)\/strong)
title = t.findall(html)
for item in title:
namespace['title'] = item
u = re.compile(ra href=\\/(.*)\En savoir plus)
url = u.findall(html)
for item in url:
namespace['url'] = item
g = re.compile(rGLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\))
lat = g.findall(html)
for item in lat:
namespace['LatLng'] = item
block_list.append(namespace)

###

can this be made better?


 Kent

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-01-31 Thread Norman Khine
Hello,
I am still unable to get this to work correctly!

In [1]: file=open('producers_google_map_code.txt', 'r')

In [2]: data =  repr( file.read().decode('utf-8') )

In [3]: from BeautifulSoup import BeautifulStoneSoup

In [4]: soup = BeautifulStoneSoup(data)

In [6]: soup

http://paste.lisp.org/display/94195

In [7]: import re

In [8]: p = re.compile(rGLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\))

In [9]: r = p.findall(data)

In [10]: r
Out[10]: []

see http://paste.lisp.org/+20BO/1

i can't seem to get the regex correct

(rGLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\))

the problem is that, each for example is:

GLatLng(27.729912,\\n  85.31559)
GLatLng(-18.889851,\\n  -66.770897)

i have a big whitespace, plus the group can have a negative value, so
if i do this:

In [31]: p = re.compile(rGLatLng\((\d+\.\d*)\,\\n
   (\d+\.\d*)\))

In [32]: r = p.findall(data)

In [33]: r
Out[33]:
[('27.729912', '85.31559'),
 ('9.696333', '122.985992'),
 ('17.964625', '102.60040'),
 ('21.046439', '105.853043'),

but this does not take into account of data which has negative values,
also i am unsure how to pull it all together. i.e. to return a CSV
file such as:

ACP, acp.html, 9.696333, 122.985992
ALTER TRADE CORPORATION, alter-trade-corporation.html,
-18.889851, -66.770897

Thanks


On Sat, Jan 23, 2010 at 12:50 AM, spir denis.s...@free.fr wrote:
 On Sat, 23 Jan 2010 00:22:41 +0100
 Norman Khine nor...@khine.net wrote:

 Hi

 On Fri, Jan 22, 2010 at 7:44 PM, spir denis.s...@free.fr wrote:
  On Fri, 22 Jan 2010 14:11:42 +0100
  Norman Khine nor...@khine.net wrote:
 
  but my problem comes when i try to list the GLatLng:
 
  GLatLng(9.696333, 122.985992);
 
   StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
   StartingWithGLatLng
  []
 
  Don't about soup's findall. But the regex pattern string should rather be 
  something like (untested):
    rGLatLng\(\(d+\.\d*)\, (d+\.\d*)\) 
  capturing both integers.
 
  Denis
 
  PS: finally tested:
 
  import re
  s = GLatLng(9.696333, 122.985992)
  p = re.compile(rGLatLng\((\d+\.\d*)\, (\d+\.\d*)\))
  r = p.match(s)
  print r.group()         # -- GLatLng(9.696333, 122.985992)
  print r.groups()        # -- ('9.696333', '122.985992')
 
  s = xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x
  r = p.findall(s)
  print r                         # -- [('1.1', '11.22'), ('111.111', 
  '.')]

 Thanks for the help, but I can't seem to get the RegEx to work correctly.

 Here is my input and output:

 http://paste.lisp.org/+20BO/1

 See my previous examples...
 If you use match:

 In [6]: r = p.match(data)

 Then the result is a regex match object (unlike when using findall). To get 
 the string(s) matched; you need to use the group() and/or groups() methods.

 import re
 p = re.compile('x')
 print p.match(xabcx)
 _sre.SRE_Match object at 0xb74de6e8
 print p.findall(xabcx)
 ['x', 'x']

 Denis
 

 la vita e estrany

 http://spir.wikidot.com/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] parse text file

2010-01-22 Thread Norman Khine
Hello,
I have the following http://paste.lisp.org/display/93732 txt file.
From this I would like to extract

...
'strongACP/strong' +
'br /a href=/acp.htmlEn savoir plus/a'
);
...
  map.addOverlay(marqueur[1]);var latlng = new GLatLng(9.696333,
  122.985992);

so that i get a CSV file:

ACP, acp.html , 9.69633, 122.985992

This is what I have so far:

 file=open('google_map_code.txt', 'r')
 data =  repr( file.read().decode('utf-8') )
 from BeautifulSoup import BeautifulStoneSoup
 soup = BeautifulStoneSoup(data)
 strongs = soup.findAll('strong')
 strongs
[strongALTER TRADE CORPORATION/strong, strongANAPQUI/strong,
strongAPICOOP / VALVIDIA/strong, strongAPIKRI/strong,
...

 path = soup.findAll('a')
 path
[a href=/acp.htmlEn savoir plus/a, a
href=/alter-trade-corporation.htmlEn savoir plus/a,
...

but my problem comes when i try to list the GLatLng:

GLatLng(9.696333, 122.985992);

 StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
 StartingWithGLatLng
[]

Thanks
Norman
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parse text file

2010-01-22 Thread Norman Khine
Hi

On Fri, Jan 22, 2010 at 7:44 PM, spir denis.s...@free.fr wrote:
 On Fri, 22 Jan 2010 14:11:42 +0100
 Norman Khine nor...@khine.net wrote:

 but my problem comes when i try to list the GLatLng:

 GLatLng(9.696333, 122.985992);

  StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
  StartingWithGLatLng
 []

 Don't about soup's findall. But the regex pattern string should rather be 
 something like (untested):
   rGLatLng\(\(d+\.\d*)\, (d+\.\d*)\) 
 capturing both integers.

 Denis

 PS: finally tested:

 import re
 s = GLatLng(9.696333, 122.985992)
 p = re.compile(rGLatLng\((\d+\.\d*)\, (\d+\.\d*)\))
 r = p.match(s)
 print r.group()         # -- GLatLng(9.696333, 122.985992)
 print r.groups()        # -- ('9.696333', '122.985992')

 s = xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x
 r = p.findall(s)
 print r                         # -- [('1.1', '11.22'), ('111.111', 
 '.')]

Thanks for the help, but I can't seem to get the RegEx to work correctly.

Here is my input and output:

http://paste.lisp.org/+20BO/1

 

 la vita e estrany

 http://spir.wikidot.com/




-- 
% .join( [ {'*':'@','^':'.'}.get(c,None) or
chr(97+(ord(c)-83)%26) for c in ,adym,*)uzq^zqf ] )
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-11 Thread spir
[Hope you don't mind I copy to the list. Not only it can help others, but 
pyparsing users read tutor, including Paul MacGuire (author).]

Le Thu, 11 Jun 2009 11:53:31 +0200,
Stefan Lesicnik ste...@lsd.co.za s'exprima ainsi:

[...]

I cannot really answer precisely for haven't used pyparsing for a while (*).

So, below are only some hints.

 Hi Denis,
 
 Thanks for your input. So i decided i should use a pyparser and try it (im a
 relative python noob though!)
 
 This is what i have so far...
 
 import sys
 from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
 Combine, Literal, alphanums, Optional, OneOrMore, SkipTo, printables
 
 text='''
 [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
 {CVE-2009-0023 CVE-2009-1955}
 [etch] - apr-util 1.2.7+dfsg-2+etch2
 [lenny] - apr-util 1.2.12+dfsg-8+lenny2
 '''
 
 date = Combine(Literal('[') + Word(nums, exact=2) + Word(alphas) +
 Word(nums, exact=4) + Literal(']'),adjacent=False)
 dsa = Combine(Word(alphanums) + Literal('-') + Word(nums, exact=4) +
 Literal('-') + Word(nums, exact=1),adjacent=False)
 app = Combine(OneOrMore(Word(printables)) + SkipTo(Literal('-')))
 desc = Combine(Literal('-') + ZeroOrMore(Word(alphas)) +
 SkipTo(Literal('\n')))
 cve = Combine(Literal('{') + OneOrMore(Literal('CVE') + Literal('-') +
 Word(nums, exact=4) + Literal('-') + Word(nums, exact=4)) )
 
 record = date + dsa + app + desc + cve
 
 fields = record.parseString(text)
 #fields = dsa.parseString(text)
 print fields
 
 
 What i get out of this is
 
 ['[04Jun2009]', 'DSA-1812-1', 'apr-util ', '- several vulnerabilities',
 '{CVE-2009-0023']
 
 Which i guess it heading towards the right track...

For sure! Rather impressing you could write this so fast. Hope my littel PEG 
grammar helped.
There seems to be some detail issues, such as in the app pattern I would write
   ...+ SkipTo(Literal(' - '))
Also, you could directly Suppress() probably useless delimiters such as [...] 
in date.

Think at post-parse funcs to transform and/or reformat nodes: search for 
setParseAction() and addParseAction() in the doc.

 I am unsure why I am not getting more than 1 CVE... I have the OneOrMore
 match for the CVE stuff...

This is due to Combine(), that glues (back) together matched string bits. To 
work safely, it disables the default separator-skipping behaviour of pyparsing. 
So that
   real = Combine(integral+fractional)
would correctly not match 1 .2. Right?
See a recent reply by Paul MacGuire about this topic on the pyparsing list 
http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2forum_name=pyparsing-users
 and the pointer he gives there.
There are several ways to correctly cope with that.

 That being said, how does the parser scale across multiple lines and how
 will it know that its finished?

Basically, you probably should express line breaks explicitely, esp. because 
they seem to be part of the source format.
Otherwise, there is a func or method to define which chars should be skipped as 
separators (default is sp/tab if I remember well).

 Should i maybe look at getting the list first into one entry per line? (must
 be easier to parse then?)

What makes sense I guess is Group()-ing items that *conceptually* build a list. 
In your case, I see:
* CVS items inside {...}
* version entry lines ([etch]..., [lenny]..., ...)
* whole records at a higher level

 This parsing is a mini language in itself!

Sure! A kind of rather big  complex parsing language. Hard to know it all well 
(and I don't even speak of all builtin helpers, and even less of all what you 
can do by mixing ordinary python code inside the grammar/parser: a whole new 
field in parsing/processing).

 Thanks for your input :)

My pleasure...

 Stefan

Denis

(*) The reason is I'm developping my own parsing tool; see 
http://spir.wikidot.com/pijnu.
The guide is also intended as a parsing tutorial, it may help, but is not 
exactly up-to-date.
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-11 Thread Stefan Lesicnik
  Hi Denis,
 
  Thanks for your input. So i decided i should use a pyparser and try it
 (im a
  relative python noob though!)


Hi Everyone!

I have made some progress, although i believe it mainly due to luck and not
a lot of understanding (vague understanding maybe).

Hopefully this can help someone else out...


This is due to Combine(), that glues (back) together matched string bits. To
 work safely, it disables the default separator-skipping behaviour of
 pyparsing. So that
   real = Combine(integral+fractional)
 would correctly not match 1 .2. Right?
 See a recent reply by Paul MacGuire about this topic on the pyparsing list
 http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2forum_name=pyparsing-usersand
  the pointer he gives there.
 There are several ways to correctly cope with that.


^ was a useful link - I still sometime struggle with the whitespaces and
combine / group...


Below is my code that works as I expect (i think...)


#!/usr/bin/python

import sys
from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
Combine, Literal, OneOrMore, SkipTo, printables, White

text='''
[04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
{CVE-2009-0023 CVE-2009-1955 CVE-2009-1243}
[etch] - apr-util 1.2.7+dfsg-2+etch2
[lenny] - apr-util 1.2.12+dfsg-8+lenny2
[01 Jun 2009] DSA-1808-1 drupal6 - insufficient input sanitising
{CVE-2009-1844}
[lenny] - drupal6 6.6-3lenny2
[01 Jun 2009] DSA-1807-1 cyrus-sasl2 cyrus-sasl2-heimdal - arbitrary code
execution
{CVE-2009-0688}
[lenny] - cyrus-sasl2-heimdal 2.1.22.dfsg1-23+lenny1
[lenny] - cyrus-sasl2 2.1.22.dfsg1-23+lenny1
[etch] - cyrus-sasl2 2.1.22.dfsg1-8+etch1
'''

lsquare = Literal('[')
rsquare = Literal(']')
lbrace = Literal('{')
rbrace = Literal('}')
dash = Literal('-')

space = White('\x20')
newline = White('\n')

spaceapp = White('\x20') + Literal('-') + White('\x20')
spaceseries = White('\t')

date = Combine(lsquare.suppress() + Word(nums, exact=2) + Word(alphas) +
Word(nums, exact=4) + rsquare.suppress(),adjacent=False,joinString='-')
dsa = Combine(Literal('DSA') + dash + Word(nums, exact=4) + dash +
Word(nums, exact=1))
app = Combine(Word(printables) + SkipTo(spaceapp))
desc = Combine(spaceapp.suppress() + ZeroOrMore(Word(alphas)) +
SkipTo(newline))
cve = Combine(lbrace.suppress() + OneOrMore(Literal('CVE') + dash +
Word(nums, exact=4) + dash + Word(nums, exact=4) + SkipTo(rbrace) +
Suppress(rbrace) + SkipTo(newline)))

series = OneOrMore(Group(lsquare.suppress() + OneOrMore(Literal('lenny') ^
Literal('etch') ^ Literal('sarge')) + rsquare.suppress() +
spaceapp.suppress() + Word(printables) + SkipTo(newline)))

record = date + dsa + app + desc + cve + series

def parse(text):
for data,dataStart,dataEnd in record.scanString(text):
yield data

for i in parse(text):
print i



My output is as follows

['04-Jun-2009', 'DSA-1812-1', 'apr-util', 'several vulnerabilities',
'CVE-2009-0023 CVE-2009-1955 CVE-2009-1243', ['etch', 'apr-util',
'1.2.7+dfsg-2+etch2'], ['lenny', 'apr-util', '1.2.12+dfsg-8+lenny2']]
['01-Jun-2009', 'DSA-1808-1', 'drupal6', 'insufficient input sanitising',
'CVE-2009-1844', ['lenny', 'drupal6', '6.6-3lenny2']]
['01-Jun-2009', 'DSA-1807-1', 'cyrus-sasl2 cyrus-sasl2-heimdal', 'arbitrary
code execution', 'CVE-2009-0688', ['lenny', 'cyrus-sasl2-heimdal',
'2.1.22.dfsg1-23+lenny1'], ['lenny', 'cyrus-sasl2',
'2.1.22.dfsg1-23+lenny1'], ['etch', 'cyrus-sasl2', '2.1.22.dfsg1-8+etch1']]


Thanks for everyone that offered assistance and prodding in right
directions.

Stefan
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parse Text File

2009-06-10 Thread Stefan Lesicnik
Hi Guys,

I have the following text

[08 Jun 2009] DSA-1813-1 evolution-data-server - several vulnerabilities
{CVE-2009-0547 CVE-2009-0582 CVE-2009-0587}
[etch] - evolution-data-server 1.6.3-5etch2
[lenny] - evolution-data-server 2.22.3-1.1+lenny1
[04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
{CVE-2009-0023 CVE-2009-1955}
[etch] - apr-util 1.2.7+dfsg-2+etch2
[lenny] - apr-util 1.2.12+dfsg-8+lenny2

... (and a whole lot more)

I would like to parse this so I can get it into a format I can work with.

I don't know anything about parsers, and my brief google has made me think
im not sure I wan't to know about them quite yet!  :)
(It looks very complex)

For previous fixed string things, i would normally split each line and
address each element, but this is not the case as there could be multiple
[lenny] or even other entries.

I would like to parse from the date to the next date and treat that all as
one element (if that makes sense)

Does anyone have any suggestions - should I be learning a parser for doing
this? Or is there perhaps an easier way.

Tia!

Stefan
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parse Text File

2009-06-10 Thread Eduardo Vieira
On Wed, Jun 10, 2009 at 12:44 PM, Stefan Lesicnikste...@lsd.co.za wrote:
 Hi Guys,

 I have the following text

 [08 Jun 2009] DSA-1813-1 evolution-data-server - several vulnerabilities
     {CVE-2009-0547 CVE-2009-0582 CVE-2009-0587}
     [etch] - evolution-data-server 1.6.3-5etch2
     [lenny] - evolution-data-server 2.22.3-1.1+lenny1
 [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
     {CVE-2009-0023 CVE-2009-1955}
     [etch] - apr-util 1.2.7+dfsg-2+etch2
     [lenny] - apr-util 1.2.12+dfsg-8+lenny2

 ... (and a whole lot more)

 I would like to parse this so I can get it into a format I can work with.

 I don't know anything about parsers, and my brief google has made me think
 im not sure I wan't to know about them quite yet!  :)
 (It looks very complex)

 For previous fixed string things, i would normally split each line and
 address each element, but this is not the case as there could be multiple
 [lenny] or even other entries.

 I would like to parse from the date to the next date and treat that all as
 one element (if that makes sense)

 Does anyone have any suggestions - should I be learning a parser for doing
 this? Or is there perhaps an easier way.

 Tia!

 Stefan
Hello, maybe if you would show a sample on how you would like the
ouput to look like it could help us give more suggestions.

Regards,

Eduardo
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor