Re: [Tutor] parse text file
On Fri, 4 Jun 2010 12:45:52 am Colin Talbert wrote: > I thought when you did a for uline in input_file each single line > would go into memory independently, not the entire file. for line in file: reads one line at a time, but file.read() tries to read everything in one go. However, it should fail with MemoryError, not just stop silently. > I'm pretty sure that this is not your code, because you can't call > len() on a bz2 file. If you try, you get an error: > > You are so correct. I'd been trying numerous things to read in this > file and had deleted the code that I meant to put here and so wrote > this from memory incorrectly. The code that I wrote should have > been: > > import bz2 > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') > str=input_file.read() > len(str) > > Which indeed does return only 90. Unfortunately, I can't download your bz2 file myself to test it, but I think I *may* have found the problem. It looks like the current bz2 module only supports files written as a single stream, and not multiple stream files. This is why the BZ2File class has no "append" mode. See this bug report: http://bugs.python.org/issue1625 My hypothesis is that your bz2 file consists of either multiple streams, or multiple bz2 files concatenated together, and the BZ2File class stops reading after the first. I can test my hypothesis: >>> bz2.BZ2File('a.bz2', 'w').write('this is the first chunk of text') >>> bz2.BZ2File('b.bz2', 'w').write('this is the second chunk of text') >>> bz2.BZ2File('c.bz2', 'w').write('this is the third chunk of text') >>> # concatenate the files ... d = file('concate.bz2', 'w') >>> for name in "abc": ... f = file('%c.bz2' % name, 'rb') ... d.write(f.read()) ... >>> d.close() >>> >>> bz2.BZ2File('concate.bz2', 'r').read() 'this is the first chunk of text' And sure enough, BZ2File only sees the first chunk of text! But if I open it in a stand-alone bz2 utility (I use the Linux application Ark), I can see all three chunks of text. So I think we have a successful test of the hypothesis. Assuming this is the problem you are having, you have a number of possible solutions: (1) Re-create the bz2 file from a single stream. (2) Use another application to expand the bz2 file and then read directly from that, skipping BZ2File altogether. (3) Upgrade to Python 2.7 or 3.2, and hope the patch is applied. (4) Backport the patch to your version of Python and apply it yourself. (5) Write your own bz2 utility. Not really a very appetising series of choices there, I must admit. Probably (1) or (2) are the least worst. -- Steven D'Aprano ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On 3 June 2010 21:02, Colin Talbert wrote: > I couldn't find any example of it in use and wasn't having any luck getting > it to work based on the documentation. Good examples of the bz2 module can be found at [1]. greets Sander [1] http://www.doughellmann.com/PyMOTW/bz2/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Thu, Jun 3, 2010 at 1:02 PM, Colin Talbert wrote: > > Dave, > I think you are probably right about using decompressor. I > couldn't find any example of it in use and wasn't having any luck getting it > to work based on the documentation. Maybe I should try harder on this > front. > Is it possible write a python script to transfer this to a hdf5 file? Would this help? Thanks Vincent > Colin Talbert > GIS Specialist > US Geological Survey - Fort Collins Science Center > 2150 Centre Ave. Bldg. C > Fort Collins, CO 80526 > > (970) 226-9425 > talbe...@usgs.gov > > > > From: Dave Angel To: > Colin Talbert > Cc: Steven D'Aprano , tutor@python.org Date: 06/03/2010 > 12:36 PM Subject: Re: [Tutor] parse text file > -- > > > > Colin Talbert wrote: > > > > You are so correct. I'd been trying numerous things to read in this file > > > and had deleted the code that I meant to put here and so wrote this from > > memory incorrectly. The code that I wrote should have been: > > > > import bz2 > > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') > > str=input_file.read() > > len(str) > > > > Which indeed does return only 90. > > > > Which is also the number returned when you sum the length of all the > lines > > returned in a for line in file with: > > > > > > import bz2 > > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') > > lengthz = 0 > > for uline in input_file: > > lengthz = lengthz + len(uline) > > > > print lengthz > > > > > > > > > Seems to me for such a large file you'd have to use > bz2.BZ2Decompressor. I have no experience with it, but its purpose is > for sequential decompression -- decompression where not all the data is > simultaneously available in memory. > > DaveA > > > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > > *Vincent Davis 720-301-3003 * vinc...@vincentdavis.net my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis> ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Dave, I think you are probably right about using decompressor. I couldn't find any example of it in use and wasn't having any luck getting it to work based on the documentation. Maybe I should try harder on this front. Colin Talbert GIS Specialist US Geological Survey - Fort Collins Science Center 2150 Centre Ave. Bldg. C Fort Collins, CO 80526 (970) 226-9425 talbe...@usgs.gov From: Dave Angel To: Colin Talbert Cc: Steven D'Aprano , tutor@python.org Date: 06/03/2010 12:36 PM Subject: Re: [Tutor] parse text file Colin Talbert wrote: > > You are so correct. I'd been trying numerous things to read in this file > and had deleted the code that I meant to put here and so wrote this from > memory incorrectly. The code that I wrote should have been: > > import bz2 > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') > str=input_file.read() > len(str) > > Which indeed does return only 90. > > Which is also the number returned when you sum the length of all the lines > returned in a for line in file with: > > > import bz2 > input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') > lengthz = 0 > for uline in input_file: > lengthz = lengthz + len(uline) > > print lengthz > > > > Seems to me for such a large file you'd have to use bz2.BZ2Decompressor. I have no experience with it, but its purpose is for sequential decompression -- decompression where not all the data is simultaneously available in memory. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Colin Talbert wrote: You are so correct. I'd been trying numerous things to read in this file and had deleted the code that I meant to put here and so wrote this from memory incorrectly. The code that I wrote should have been: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') str=input_file.read() len(str) Which indeed does return only 90. Which is also the number returned when you sum the length of all the lines returned in a for line in file with: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') lengthz = 0 for uline in input_file: lengthz = lengthz + len(uline) print lengthz Seems to me for such a large file you'd have to use bz2.BZ2Decompressor. I have no experience with it, but its purpose is for sequential decompression -- decompression where not all the data is simultaneously available in memory. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
"Colin Talbert" wrote I thought when you did a for uline in input_file each single line would go into memory independently, not the entire file. Thats true but your code snippet showed you using read() which reads the whole file... I'm pretty sure that this is not your code, because you can't call len() on a bz2 file. If you try, you get an error: You are so correct. I'd been trying numerous things to read in this file and had deleted the code that I meant to put here and so wrote this from memory incorrectly. The code that I wrote should have been: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') str=input_file.read() len(str) This again usees read() which reads the whole file. Which is also the number returned when you sum the length of all the lines returned in a for line in file with: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') lengthz = 0 for uline in input_file: lengthz = lengthz + len(uline) I'm not sure how for line in file will work for binary files. It may read the whole thing since the concept of lines really only applies to text. So it may be the same result as using read() Try looping using read(n) where n is some buffer size (1024 might be a good value?). HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Hello Steven, Thanks for the reply. Also this is my first post to tu...@python so I'll reply all in the future. However, a file of that size changes things drastically. You can't expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file into memory at once, let along the unpacked 131 GB text file, EVEN if your computer has more than 9.2 GB of memory. So your tests need to take this into account. I thought when you did a for uline in input_file each single line would go into memory independently, not the entire file. I'm pretty sure that this is not your code, because you can't call len() on a bz2 file. If you try, you get an error: You are so correct. I'd been trying numerous things to read in this file and had deleted the code that I meant to put here and so wrote this from memory incorrectly. The code that I wrote should have been: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') str=input_file.read() len(str) Which indeed does return only 90. Which is also the number returned when you sum the length of all the lines returned in a for line in file with: import bz2 input_file = bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') lengthz = 0 for uline in input_file: lengthz = lengthz + len(uline) print lengthz Thanks again for you help and sorry for the bad code in the previous submittal. Colin Talbert GIS Specialist US Geological Survey - Fort Collins Science Center 2150 Centre Ave. Bldg. C Fort Collins, CO 80526 (970) 226-9425 talbe...@usgs.gov From: Steven D'Aprano To: tutor@python.org Date: 06/02/2010 03:42 PM Subject: Re: [Tutor] parse text file Sent by: tutor-bounces+talbertc=usgs@python.org Hi Colin, I'm taking the liberty of replying to your message back to the list, as others hopefully may be able to make constructive comments. When replying, please ensure that you reply to the tutor mailing list rather than then individual. On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote: > > Without seeing your text file, and the code you use to read the text > > file, there's no way of telling what is going on, but I can guess > > the most likely causes: > > Since the file is 9.2 gig it wouldn't make sense to send it to you. And I am very glad you didn't try *smiles* However, a file of that size changes things drastically. You can't expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file into memory at once, let along the unpacked 131 GB text file, EVEN if your computer has more than 9.2 GB of memory. So your tests need to take this into account. > > (2) There's a bug in your code so that you stop reading after > > 900,000 bytes. > The code is simple enough that I'm pretty sure there is not a > bug in it. > > import bz2 > input_file = > bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print > len(input_file) > > returns 90 I'm pretty sure that this is not your code, because you can't call len() on a bz2 file. If you try, you get an error: >>> x = bz2.BZ2File('test.bz2', 'w') # create a temporary file >>> x.write("some data") >>> x.close() >>> input_file = bz2.BZ2File('test.bz2', 'r') # open it >>> print len(input_file) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'bz2.BZ2File' has no len() So whatever your code actually is, I'm fairly sure it isn't what you say here. -- Steven D'Aprano ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Hi Colin, I'm taking the liberty of replying to your message back to the list, as others hopefully may be able to make constructive comments. When replying, please ensure that you reply to the tutor mailing list rather than then individual. On Thu, 3 Jun 2010 12:20:10 am Colin Talbert wrote: > > Without seeing your text file, and the code you use to read the text > > file, there's no way of telling what is going on, but I can guess > > the most likely causes: > > Since the file is 9.2 gig it wouldn't make sense to send it to you. And I am very glad you didn't try *smiles* However, a file of that size changes things drastically. You can't expect to necessarily be able to read the entire 9.2 gigabyte BZ2 file into memory at once, let along the unpacked 131 GB text file, EVEN if your computer has more than 9.2 GB of memory. So your tests need to take this into account. > > (2) There's a bug in your code so that you stop reading after > > 900,000 bytes. > The code is simple enough that I'm pretty sure there is not a > bug in it. > > import bz2 > input_file = > bz2.BZ2File(r'C:\temp\planet-latest.osm.bz2','rb') print > len(input_file) > > returns 90 I'm pretty sure that this is not your code, because you can't call len() on a bz2 file. If you try, you get an error: >>> x = bz2.BZ2File('test.bz2', 'w') # create a temporary file >>> x.write("some data") >>> x.close() >>> input_file = bz2.BZ2File('test.bz2', 'r') # open it >>> print len(input_file) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'bz2.BZ2File' has no len() So whatever your code actually is, I'm fairly sure it isn't what you say here. -- Steven D'Aprano ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Please always reply-all so a copy goes to the list. On 6/1/2010 6:49 PM, Colin Talbert wrote: Bob thanks for your response, The file is about 9.3 gig and no I don't want read the whole thing at once. I want to read it in line by line. Still it will read in to the same point (90 characters) and then act as if it came to the end of the file. Below is the code I using for this: import bz2 input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","rb") for uline in input_file: print linecount linecount+=1 Colin Talbert GIS Specialist US Geological Survey - Fort Collins Science Center 2150 Centre Ave. Bldg. C Fort Collins, CO 80526 (970) 226-9425 talbe...@usgs.gov From: bob gailer To: Colin Talbert Cc: tutor@python.org Date: 06/01/2010 04:43 PM Subject: Re: [Tutor] parse text file On 6/1/2010 5:40 PM, Colin Talbert wrote: I am also experiencing this same problem. (Also on a OSM bz2 file). It appears to be working but then partway through reading a file it simple ends. I did track down that file length is always 90 so it appears to be related to some sort of buffer constraint. Any other ideas? How big is the file? Is it necessary to read the entire thing at once? Try opening with mode rb import bz2 input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r") try: all_data = input_file.read() print str(len(all_data)) finally: input_file.close() -- Bob Gailer 919-636-4239 Chapel Hill NC ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Wed, 2 Jun 2010 07:40:33 am Colin Talbert wrote: > I am also experiencing this same problem. (Also on a OSM bz2 > file). It appears to be working but then partway through reading a > file it simple ends. I did track down that file length is always > 90 so it appears to be related to some sort of buffer constraint. Without seeing your text file, and the code you use to read the text file, there's no way of telling what is going on, but I can guess the most likely causes: (1) Your text file is actually only 900,000 bytes long, and so there's no problem at all. (2) There's a bug in your code so that you stop reading after 900,000 bytes. (3) You're on Windows, and the text file contains an End-Of-File character ^Z after 900,000 bytes, and Windows supports that for backward compatibility with DOS. And a distant (VERY distant) number 4, there's a bug in the implementation of read() in Python which somehow nobody has noticed before now. As for your second issue, reading bz2 files: > import bz2 > > input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r") You're opening a binary file in text mode. I'm pretty sure that is not going to work well. Try passing 'rb' as the mode instead. > try: > all_data = input_file.read() > print str(len(all_data)) You don't need to call str() before calling print. print is perfectly happy to operate on integers: print len(all_data) will work. -- Steven D'Aprano ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On 6/1/2010 5:40 PM, Colin Talbert wrote: I am also experiencing this same problem. (Also on a OSM bz2 file). It appears to be working but then partway through reading a file it simple ends. I did track down that file length is always 90 so it appears to be related to some sort of buffer constraint. Any other ideas? How big is the file? Is it necessary to read the entire thing at once? Try opening with mode rb import bz2 input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r") try: all_data = input_file.read() print str(len(all_data)) finally: input_file.close() -- Bob Gailer 919-636-4239 Chapel Hill NC ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
I am also experiencing this same problem. (Also on a OSM bz2 file). It appears to be working but then partway through reading a file it simple ends. I did track down that file length is always 90 so it appears to be related to some sort of buffer constraint. Any other ideas? import bz2 input_file = bz2.BZ2File(r"C:\temp\planet-latest.osm.bz2","r") try: all_data = input_file.read() print str(len(all_data)) finally: input_file.close() Colin Talbert GIS Specialist US Geological Survey - Fort Collins Science Center 2150 Centre Ave. Bldg. C Fort Collins, CO 80526 (970) 226-9425 talbe...@usgs.gov ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 11:36 PM, Kent Johnson wrote: > On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine wrote: >> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson wrote: > >>> Try this version: >>> >>> data = file.read() >>> >>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: >>> myIcon\n""", re.DOTALL).findall >>> get_titles = re.compile(r"""(.*)<\/strong>""").findall >>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall >>> get_latlngs = >>> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall >>> >>> then as before. >>> >>> Your repr() call is essentially removing newlines from the input by >>> converting them to literal '\n' pairs. This allows your regex to work >>> without the DOTALL modifier. >>> >>> Note you will get slightly different results with my version - it will >>> give you correct utf-8 text for the titles whereas yours gives \ >>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your >>> version returns >>> >>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} >>> >>> Mine gives >>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} >>> >>> This is showing the repr() of the title so they both have \ but note >>> that yours has two \\ indicating that the \ is in the text; mine has >>> only one \. >> >> i am no expert, but there seems to be a bigger difference. >> >> with repr(), i get: >> Sat\\xe9re Maw\\xe9 >> >> where as you get >> >> Sat\xc3\xa9re Maw\xc3\xa9 >> >> repr()'s >> é == \\xe9 >> whereas on your version >> é == \xc3\xa9 > > Right. Your version has four actual characters in the result - \, x, > e, 9. This is the escaped representation of the unicode representation > of e-acute. (The \ is doubled in the repr display.) > > My version has two bytes in the result, with the values c3 and a9. > This is the utf-8 representation of e-acute. > > If you want to accurately represent (i.e. print) the title at some > later time you probably want the utf-8 represetation. >> >>> >>> Kent >>> >> >> also, i still get an empty list when i run the code as suggested. > > You didn't change the regexes. You have to change \\t and \\n to \t > and \n because the source text now has actual tabs and newlines, not > the escaped representations. > > I know this is confusing, I'm sorry I don't have time or patience to > explain more. thanks for your time, i did realise after i posted the email that the regex needed to be changed. > > Kent > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, 2 Feb 2010 22:56:22 +0100 Norman Khine wrote: > i am no expert, but there seems to be a bigger difference. > > with repr(), i get: > Sat\\xe9re Maw\\xe9 > > where as you get > > Sat\xc3\xa9re Maw\xc3\xa9 > > repr()'s > é == \\xe9 > whereas on your version > é == \xc3\xa9 This is a rather complicated issue mixing python str, unicode string, and their repr(). Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', which should only appear in a unicode string. So: unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9" or more simply: u"\u00e9".encode("utf8") == "\xc3\xa9" Conversely: unicode("\xc3\xa9", "utf8") == u"\u00e9" -- decoding The question is: what do you want to do with the result? You'll need either the utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for processing). But what you actually get is a kind of mix, actually the (python str) repr of a unicode string. > also, i still get an empty list when i run the code as suggested. ? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching at \n by default) Denis la vita e estrany http://spir.wikidot.com/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine wrote: > On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson wrote: >> Try this version: >> >> data = file.read() >> >> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: >> myIcon\n""", re.DOTALL).findall >> get_titles = re.compile(r"""(.*)<\/strong>""").findall >> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall >> get_latlngs = >> re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall >> >> then as before. >> >> Your repr() call is essentially removing newlines from the input by >> converting them to literal '\n' pairs. This allows your regex to work >> without the DOTALL modifier. >> >> Note you will get slightly different results with my version - it will >> give you correct utf-8 text for the titles whereas yours gives \ >> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your >> version returns >> >> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} >> >> Mine gives >> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', >> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} >> >> This is showing the repr() of the title so they both have \ but note >> that yours has two \\ indicating that the \ is in the text; mine has >> only one \. > > i am no expert, but there seems to be a bigger difference. > > with repr(), i get: > Sat\\xe9re Maw\\xe9 > > where as you get > > Sat\xc3\xa9re Maw\xc3\xa9 > > repr()'s > é == \\xe9 > whereas on your version > é == \xc3\xa9 Right. Your version has four actual characters in the result - \, x, e, 9. This is the escaped representation of the unicode representation of e-acute. (The \ is doubled in the repr display.) My version has two bytes in the result, with the values c3 and a9. This is the utf-8 representation of e-acute. If you want to accurately represent (i.e. print) the title at some later time you probably want the utf-8 represetation. > >> >> Kent >> > > also, i still get an empty list when i run the code as suggested. You didn't change the regexes. You have to change \\t and \\n to \t and \n because the source text now has actual tabs and newlines, not the escaped representations. I know this is confusing, I'm sorry I don't have time or patience to explain more. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson wrote: > On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine wrote: >> On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson wrote: >>> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine wrote: On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson wrote: > On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: > > Why do you use repr() here? > >>> >>> It smells of programming by guess rather than a correct solution to >>> some problem. What happens if you take it out? >> >> when i take it out, i get an empty list. >> >> whereas both >> data = repr( file.read().decode('latin-1') ) >> and >> data = repr( file.read().decode('utf-8') ) >> >> returns the full list. > > Try this version: > > data = file.read() > > get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: > myIcon\n""", re.DOTALL).findall > get_titles = re.compile(r"""(.*)<\/strong>""").findall > get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall > get_latlngs = > re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall > > then as before. > > Your repr() call is essentially removing newlines from the input by > converting them to literal '\n' pairs. This allows your regex to work > without the DOTALL modifier. > > Note you will get slightly different results with my version - it will > give you correct utf-8 text for the titles whereas yours gives \ > escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your > version returns > > {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', > '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} > > Mine gives > {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', > '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} > > This is showing the repr() of the title so they both have \ but note > that yours has two \\ indicating that the \ is in the text; mine has > only one \. i am no expert, but there seems to be a bigger difference. with repr(), i get: Sat\\xe9re Maw\\xe9 where as you get Sat\xc3\xa9re Maw\xc3\xa9 repr()'s é == \\xe9 whereas on your version é == \xc3\xa9 > > Kent > also, i still get an empty list when i run the code as suggested. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 1:39 PM, Norman Khine wrote: > On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson wrote: >> On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine wrote: >>> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson wrote: On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: Why do you use repr() here? >> >> It smells of programming by guess rather than a correct solution to >> some problem. What happens if you take it out? > > when i take it out, i get an empty list. > > whereas both > data = repr( file.read().decode('latin-1') ) > and > data = repr( file.read().decode('utf-8') ) > > returns the full list. Try this version: data = file.read() get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon: myIcon\n""", re.DOTALL).findall get_titles = re.compile(r"""(.*)<\/strong>""").findall get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall then as before. Your repr() call is essentially removing newlines from the input by converting them to literal '\n' pairs. This allows your regex to work without the DOTALL modifier. Note you will get slightly different results with my version - it will give you correct utf-8 text for the titles whereas yours gives \ escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your version returns {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'} Mine gives {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804', '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'} This is showing the repr() of the title so they both have \ but note that yours has two \\ indicating that the \ is in the text; mine has only one \. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 4:19 PM, Kent Johnson wrote: > On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine wrote: >> On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson wrote: >>> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: >>> here are the changes: import re file=open('producers_google_map_code.txt', 'r') data = repr( file.read().decode('utf-8') ) >>> >>> Why do you use repr() here? >> >> i have latin-1 chars in the producers_google_map_code.txt' file and >> this is the only way to get it to read the data. >> >> is this incorrect? > > Well, the repr() call is after the file read. If your data is latin-1 > you should decode it as latin-1, not utf-8: > data = file.read().decode('latin-1') > > Though if the decode('utf-8') succeeds, and you do have non-ascii > characters in the data, they are probably encoded in utf-8, not > latin-1. Are you sure you have latin-1? > > The repr() call converts back to ascii text, maybe that is what you want? > > Perhaps you put in the repr because you were having trouble printing? > > It smells of programming by guess rather than a correct solution to > some problem. What happens if you take it out? when i take it out, i get an empty list. whereas both data = repr( file.read().decode('latin-1') ) and data = repr( file.read().decode('utf-8') ) returns the full list. here is the file http://cdn.admgard.org/documents/producers_google_map_code.txt > > Kent > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 9:33 AM, Norman Khine wrote: > On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson wrote: >> On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: >> >>> here are the changes: >>> >>> import re >>> file=open('producers_google_map_code.txt', 'r') >>> data = repr( file.read().decode('utf-8') ) >> >> Why do you use repr() here? > > i have latin-1 chars in the producers_google_map_code.txt' file and > this is the only way to get it to read the data. > > is this incorrect? Well, the repr() call is after the file read. If your data is latin-1 you should decode it as latin-1, not utf-8: data = file.read().decode('latin-1') Though if the decode('utf-8') succeeds, and you do have non-ascii characters in the data, they are probably encoded in utf-8, not latin-1. Are you sure you have latin-1? The repr() call converts back to ascii text, maybe that is what you want? Perhaps you put in the repr because you were having trouble printing? It smells of programming by guess rather than a correct solution to some problem. What happens if you take it out? Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
hello, thank you all for the advise, here is the updated version with the changes. import re file = open('producers_google_map_code.txt', 'r') data = repr( file.read().decode('utf-8') ) get_records = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""").findall get_titles = re.compile(r"""(.*)<\/strong>""").findall get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""").findall records = get_records(data) block_record = [] for record in records: namespace = {} titles = get_titles(record) title = titles[-1] if titles else None urls = get_urls(record) url = urls[-1] if urls else None latlngs = get_latlngs(record) latlng = latlngs[-1] if latlngs else None block_record.append( {'title':title, 'url':url, 'lating':latlng} ) print block_record On Tue, Feb 2, 2010 at 1:27 PM, Kent Johnson wrote: > On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: > >> here are the changes: >> >> import re >> file=open('producers_google_map_code.txt', 'r') >> data = repr( file.read().decode('utf-8') ) > > Why do you use repr() here? i have latin-1 chars in the producers_google_map_code.txt' file and this is the only way to get it to read the data. is this incorrect? > >> get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") >> get_title = re.compile(r"""(.*)<\/strong>""") >> get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") >> get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") >> >> records = get_record.findall(data) >> block_record = [] >> for record in records: >> namespace = {} >> titles = get_title.findall(record) >> for title in titles: >> namespace['title'] = title > > > This is odd, you don't need a loop to get the last title, just use > namespace['title'] = get_title.findall(html)[-1] > > and similarly for url and latings. > > Kent > > >> urls = get_url.findall(record) >> for url in urls: >> namespace['url'] = url >> latlngs = get_latlng.findall(record) >> for latlng in latlngs: >> namespace['latlng'] = latlng >> block_record.append(namespace) >> >> print block_record >>> >>> The def of "namespace" would be clearer imo in a single line: >>> namespace = {title:t, url:url, lat:g} >> >> i am not sure how this will fit into the code! >> >>> This also reveals a kind of name confusion, doesn't it? >>> >>> >>> Denis >>> >>> >>> >>> >>> >>> >>> la vita e estrany >>> >>> http://spir.wikidot.com/ >>> ___ >>> Tutor maillist - tu...@python.org >>> To unsubscribe or change subscription options: >>> http://mail.python.org/mailman/listinfo/tutor >>> >> ___ >> Tutor maillist - tu...@python.org >> To unsubscribe or change subscription options: >> http://mail.python.org/mailman/listinfo/tutor >> > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Tue, Feb 2, 2010 at 4:16 AM, Norman Khine wrote: > here are the changes: > > import re > file=open('producers_google_map_code.txt', 'r') > data = repr( file.read().decode('utf-8') ) Why do you use repr() here? > get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") > get_title = re.compile(r"""(.*)<\/strong>""") > get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") > get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") > > records = get_record.findall(data) > block_record = [] > for record in records: > namespace = {} > titles = get_title.findall(record) > for title in titles: > namespace['title'] = title This is odd, you don't need a loop to get the last title, just use namespace['title'] = get_title.findall(html)[-1] and similarly for url and latings. Kent > urls = get_url.findall(record) > for url in urls: > namespace['url'] = url > latlngs = get_latlng.findall(record) > for latlng in latlngs: > namespace['latlng'] = latlng > block_record.append(namespace) > > print block_record >> >> The def of "namespace" would be clearer imo in a single line: >> namespace = {title:t, url:url, lat:g} > > i am not sure how this will fit into the code! > >> This also reveals a kind of name confusion, doesn't it? >> >> >> Denis >> >> >> >> >> >> >> la vita e estrany >> >> http://spir.wikidot.com/ >> ___ >> Tutor maillist - tu...@python.org >> To unsubscribe or change subscription options: >> http://mail.python.org/mailman/listinfo/tutor >> > ___ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Norman Khine wrote: thanks denis, On Tue, Feb 2, 2010 at 9:30 AM, spir wrote: On Mon, 1 Feb 2010 16:30:02 +0100 Norman Khine wrote: On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson wrote: On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine wrote: thanks, what about the whitespace problem? \s* will match any amount of whitespace includin newlines. thank you, this worked well. here is the code: ### import re file=en('producers_google_map_code.txt', 'r') data =repr( file.read().decode('utf-8') ) block =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") b =lock.findall(data) block_list =] for html in b: namespace =} t =e.compile(r"""(.*)<\/strong>""") title =.findall(html) for item in title: namespace['title'] =tem u =e.compile(r"""a href=\"\/(.*)\">En savoir plus""") url =.findall(html) for item in url: namespace['url'] =tem g =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") lat =.findall(html) for item in lat: namespace['LatLng'] =tem block_list.append(namespace) ### can this be made better? The 3 regex patterns are constants: they can be put out of the loop. You may also rename b to blocks, and find a more a more accurate name for block_list; eg block_records, where record =et of (named) fields. A short desc and/or example of the overall and partial data formats can greatly help later review, since regex patterns alone are hard to decode. here are the changes: import re file=en('producers_google_map_code.txt', 'r') data =repr( file.read().decode('utf-8') ) get_record =e.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") get_title =e.compile(r"""(.*)<\/strong>""") get_url =e.compile(r"""a href=\"\/(.*)\">En savoir plus""") get_latlng =e.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") records =et_record.findall(data) block_record =] for record in records: namespace =} titles =et_title.findall(record) for title in titles: namespace['title'] =itle urls =et_url.findall(record) for url in urls: namespace['url'] =rl latlngs =et_latlng.findall(record) for latlng in latlngs: namespace['latlng'] =atlng block_record.append(namespace) print block_record The def of "namespace" would be clearer imo in a single line: namespace =title:t, url:url, lat:g} i am not sure how this will fit into the code! This also reveals a kind of name confusion, doesn't it? Denis Your variable 'file' is hiding a built-in name for the file type. No harm in this example, but it's a bad habit to get into. What did you intend to happen if the number of titles, urls, and latIngs are not each exactly one? As you have it now, if there's more than one, you spend time adding them all to the dictionary, but only the last one survives. And if there aren't any, you don't make an entry in the dictionary. If that's the exact behavior you want, then you could replace the loop with an if statement: (untested) if titles: namespace['title'] = titles[-1] On the other hand, if you want a None in your dictionary for missing information, then something like: (untested) for record in records: titles = get_title.findall(record) title = titles[-1] if titles else None urls = get_url.findall(record) url = urls[-1] if urls else None latlngs = get_latlng.findall(record) lating = latings[-1] if latings else None block_record.append( {'title':title, 'url':url, 'lating':lating{ ) DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Norman Khine, 02.02.2010 10:16: > get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") > get_title = re.compile(r"""(.*)<\/strong>""") > get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") > get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") > > records = get_record.findall(data) > block_record = [] > for record in records: > namespace = {} > titles = get_title.findall(record) > for title in titles: > namespace['title'] = title I usually go one step further: find_all_titles = re.compile(r"""(.*)<\/strong>""").findall for record in records: titles = find_all_titles(record) Both faster and more readable (as is so common in Python). Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
thanks denis, On Tue, Feb 2, 2010 at 9:30 AM, spir wrote: > On Mon, 1 Feb 2010 16:30:02 +0100 > Norman Khine wrote: > >> On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson wrote: >> > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine wrote: >> > >> >> thanks, what about the whitespace problem? >> > >> > \s* will match any amount of whitespace includin newlines. >> >> thank you, this worked well. >> >> here is the code: >> >> ### >> import re >> file=open('producers_google_map_code.txt', 'r') >> data = repr( file.read().decode('utf-8') ) >> >> block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") >> b = block.findall(data) >> block_list = [] >> for html in b: >> namespace = {} >> t = re.compile(r"""(.*)<\/strong>""") >> title = t.findall(html) >> for item in title: >> namespace['title'] = item >> u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") >> url = u.findall(html) >> for item in url: >> namespace['url'] = item >> g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") >> lat = g.findall(html) >> for item in lat: >> namespace['LatLng'] = item >> block_list.append(namespace) >> >> ### >> >> can this be made better? > > The 3 regex patterns are constants: they can be put out of the loop. > > You may also rename b to blocks, and find a more a more accurate name for > block_list; eg block_records, where record = set of (named) fields. > > A short desc and/or example of the overall and partial data formats can > greatly help later review, since regex patterns alone are hard to decode. here are the changes: import re file=open('producers_google_map_code.txt', 'r') data = repr( file.read().decode('utf-8') ) get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") get_title = re.compile(r"""(.*)<\/strong>""") get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") records = get_record.findall(data) block_record = [] for record in records: namespace = {} titles = get_title.findall(record) for title in titles: namespace['title'] = title urls = get_url.findall(record) for url in urls: namespace['url'] = url latlngs = get_latlng.findall(record) for latlng in latlngs: namespace['latlng'] = latlng block_record.append(namespace) print block_record > > The def of "namespace" would be clearer imo in a single line: > namespace = {title:t, url:url, lat:g} i am not sure how this will fit into the code! > This also reveals a kind of name confusion, doesn't it? > > > Denis > > > > > > > la vita e estrany > > http://spir.wikidot.com/ > ___ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Mon, 1 Feb 2010 16:30:02 +0100 Norman Khine wrote: > On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson wrote: > > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine wrote: > > > >> thanks, what about the whitespace problem? > > > > \s* will match any amount of whitespace includin newlines. > > thank you, this worked well. > > here is the code: > > ### > import re > file=open('producers_google_map_code.txt', 'r') > data = repr( file.read().decode('utf-8') ) > > block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") > b = block.findall(data) > block_list = [] > for html in b: > namespace = {} > t = re.compile(r"""(.*)<\/strong>""") > title = t.findall(html) > for item in title: > namespace['title'] = item > u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") > url = u.findall(html) > for item in url: > namespace['url'] = item > g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") > lat = g.findall(html) > for item in lat: > namespace['LatLng'] = item > block_list.append(namespace) > > ### > > can this be made better? The 3 regex patterns are constants: they can be put out of the loop. You may also rename b to blocks, and find a more a more accurate name for block_list; eg block_records, where record = set of (named) fields. A short desc and/or example of the overall and partial data formats can greatly help later review, since regex patterns alone are hard to decode. The def of "namespace" would be clearer imo in a single line: namespace = {title:t, url:url, lat:g} This also reveals a kind of name confusion, doesn't it? Denis la vita e estrany http://spir.wikidot.com/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson wrote: > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine wrote: > >> thanks, what about the whitespace problem? > > \s* will match any amount of whitespace includin newlines. thank you, this worked well. here is the code: ### import re file=open('producers_google_map_code.txt', 'r') data = repr( file.read().decode('utf-8') ) block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""") b = block.findall(data) block_list = [] for html in b: namespace = {} t = re.compile(r"""(.*)<\/strong>""") title = t.findall(html) for item in title: namespace['title'] = item u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""") url = u.findall(html) for item in url: namespace['url'] = item g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""") lat = g.findall(html) for item in lat: namespace['LatLng'] = item block_list.append(namespace) ### can this be made better? > > Kent > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine wrote: > thanks, what about the whitespace problem? \s* will match any amount of whitespace includin newlines. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Mon, Feb 1, 2010 at 10:57 AM, spir wrote: > On Mon, 1 Feb 2010 00:43:59 +0100 > Norman Khine wrote: > >> but this does not take into account of data which has negative values > > just add \-? in front of \d+ thanks, what about the whitespace problem? > > Denis > > > la vita e estrany > > http://spir.wikidot.com/ > ___ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > -- %>>> "".join( [ {'*':'@','^':'.'}.get(c,None) or chr(97+(ord(c)-83)%26) for c in ",adym,*)&uzq^zqf" ] ) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
On Mon, 1 Feb 2010 00:43:59 +0100 Norman Khine wrote: > but this does not take into account of data which has negative values just add \-? in front of \d+ Denis la vita e estrany http://spir.wikidot.com/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Hello, I am still unable to get this to work correctly! In [1]: file=open('producers_google_map_code.txt', 'r') In [2]: data = repr( file.read().decode('utf-8') ) In [3]: from BeautifulSoup import BeautifulStoneSoup In [4]: soup = BeautifulStoneSoup(data) In [6]: soup http://paste.lisp.org/display/94195 In [7]: import re In [8]: p = re.compile(r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""") In [9]: r = p.findall(data) In [10]: r Out[10]: [] see http://paste.lisp.org/+20BO/1 i can't seem to get the regex correct (r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""") the problem is that, each for example is: GLatLng(27.729912,\\n 85.31559) GLatLng(-18.889851,\\n -66.770897) i have a big whitespace, plus the group can have a negative value, so if i do this: In [31]: p = re.compile(r"""GLatLng\((\d+\.\d*)\,\\n (\d+\.\d*)\)""") In [32]: r = p.findall(data) In [33]: r Out[33]: [('27.729912', '85.31559'), ('9.696333', '122.985992'), ('17.964625', '102.60040'), ('21.046439', '105.853043'), but this does not take into account of data which has negative values, also i am unsure how to pull it all together. i.e. to return a CSV file such as: "ACP", "acp.html", "9.696333", "122.985992" "ALTER TRADE CORPORATION", "alter-trade-corporation.html", "-18.889851", "-66.770897" Thanks On Sat, Jan 23, 2010 at 12:50 AM, spir wrote: > On Sat, 23 Jan 2010 00:22:41 +0100 > Norman Khine wrote: > >> Hi >> >> On Fri, Jan 22, 2010 at 7:44 PM, spir wrote: >> > On Fri, 22 Jan 2010 14:11:42 +0100 >> > Norman Khine wrote: >> > >> >> but my problem comes when i try to list the GLatLng: >> >> >> >> GLatLng(9.696333, 122.985992); >> >> >> >> >>> StartingWithGLatLng = soup.findAll(re.compile('GLatLng')) >> >> >>> StartingWithGLatLng >> >> [] >> > >> > Don't about soup's findall. But the regex pattern string should rather be >> > something like (untested): >> > r"""GLatLng\(\(d+\.\d*)\, (d+\.\d*)\) """ >> > capturing both integers. >> > >> > Denis >> > >> > PS: finally tested: >> > >> > import re >> > s = "GLatLng(9.696333, 122.985992)" >> > p = re.compile(r"""GLatLng\((\d+\.\d*)\, (\d+\.\d*)\)""") >> > r = p.match(s) >> > print r.group() # --> GLatLng(9.696333, 122.985992) >> > print r.groups() # --> ('9.696333', '122.985992') >> > >> > s = "xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x" >> > r = p.findall(s) >> > print r # --> [('1.1', '11.22'), ('111.111', >> > '.')] >> >> Thanks for the help, but I can't seem to get the RegEx to work correctly. >> >> Here is my input and output: >> >> http://paste.lisp.org/+20BO/1 > > See my previous examples... > If you use match: > > In [6]: r = p.match(data) > > Then the result is a regex match object (unlike when using findall). To get > the string(s) matched; you need to use the group() and/or groups() methods. > import re p = re.compile('x') print p.match("xabcx") > <_sre.SRE_Match object at 0xb74de6e8> print p.findall("xabcx") > ['x', 'x'] > > Denis > > > la vita e estrany > > http://spir.wikidot.com/ > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parse text file
Hi On Fri, Jan 22, 2010 at 7:44 PM, spir wrote: > On Fri, 22 Jan 2010 14:11:42 +0100 > Norman Khine wrote: > >> but my problem comes when i try to list the GLatLng: >> >> GLatLng(9.696333, 122.985992); >> >> >>> StartingWithGLatLng = soup.findAll(re.compile('GLatLng')) >> >>> StartingWithGLatLng >> [] > > Don't about soup's findall. But the regex pattern string should rather be > something like (untested): > r"""GLatLng\(\(d+\.\d*)\, (d+\.\d*)\) """ > capturing both integers. > > Denis > > PS: finally tested: > > import re > s = "GLatLng(9.696333, 122.985992)" > p = re.compile(r"""GLatLng\((\d+\.\d*)\, (\d+\.\d*)\)""") > r = p.match(s) > print r.group() # --> GLatLng(9.696333, 122.985992) > print r.groups() # --> ('9.696333', '122.985992') > > s = "xGLatLng(1.1, 11.22)xxxGLatLng(111.111, .)x" > r = p.findall(s) > print r # --> [('1.1', '11.22'), ('111.111', > '.')] Thanks for the help, but I can't seem to get the RegEx to work correctly. Here is my input and output: http://paste.lisp.org/+20BO/1 > > > la vita e estrany > > http://spir.wikidot.com/ > -- %>>> "".join( [ {'*':'@','^':'.'}.get(c,None) or chr(97+(ord(c)-83)%26) for c in ",adym,*)&uzq^zqf" ] ) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parse Text File
> > Hi Denis, > > > > Thanks for your input. So i decided i should use a pyparser and try it > (im a > > relative python noob though!) > Hi Everyone! I have made some progress, although i believe it mainly due to luck and not a lot of understanding (vague understanding maybe). Hopefully this can help someone else out... This is due to Combine(), that glues (back) together matched string bits. To > work safely, it disables the default separator-skipping behaviour of > pyparsing. So that > real = Combine(integral+fractional) > would correctly not match "1 .2". Right? > See a recent reply by Paul MacGuire about this topic on the pyparsing list > http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-usersand > the pointer he gives there. > There are several ways to correctly cope with that. > ^ was a useful link - I still sometime struggle with the whitespaces and combine / group... Below is my code that works as I expect (i think...) #!/usr/bin/python import sys from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress, Combine, Literal, OneOrMore, SkipTo, printables, White text=''' [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities {CVE-2009-0023 CVE-2009-1955 CVE-2009-1243} [etch] - apr-util 1.2.7+dfsg-2+etch2 [lenny] - apr-util 1.2.12+dfsg-8+lenny2 [01 Jun 2009] DSA-1808-1 drupal6 - insufficient input sanitising {CVE-2009-1844} [lenny] - drupal6 6.6-3lenny2 [01 Jun 2009] DSA-1807-1 cyrus-sasl2 cyrus-sasl2-heimdal - arbitrary code execution {CVE-2009-0688} [lenny] - cyrus-sasl2-heimdal 2.1.22.dfsg1-23+lenny1 [lenny] - cyrus-sasl2 2.1.22.dfsg1-23+lenny1 [etch] - cyrus-sasl2 2.1.22.dfsg1-8+etch1 ''' lsquare = Literal('[') rsquare = Literal(']') lbrace = Literal('{') rbrace = Literal('}') dash = Literal('-') space = White('\x20') newline = White('\n') spaceapp = White('\x20') + Literal('-') + White('\x20') spaceseries = White('\t') date = Combine(lsquare.suppress() + Word(nums, exact=2) + Word(alphas) + Word(nums, exact=4) + rsquare.suppress(),adjacent=False,joinString='-') dsa = Combine(Literal('DSA') + dash + Word(nums, exact=4) + dash + Word(nums, exact=1)) app = Combine(Word(printables) + SkipTo(spaceapp)) desc = Combine(spaceapp.suppress() + ZeroOrMore(Word(alphas)) + SkipTo(newline)) cve = Combine(lbrace.suppress() + OneOrMore(Literal('CVE') + dash + Word(nums, exact=4) + dash + Word(nums, exact=4) + SkipTo(rbrace) + Suppress(rbrace) + SkipTo(newline))) series = OneOrMore(Group(lsquare.suppress() + OneOrMore(Literal('lenny') ^ Literal('etch') ^ Literal('sarge')) + rsquare.suppress() + spaceapp.suppress() + Word(printables) + SkipTo(newline))) record = date + dsa + app + desc + cve + series def parse(text): for data,dataStart,dataEnd in record.scanString(text): yield data for i in parse(text): print i My output is as follows ['04-Jun-2009', 'DSA-1812-1', 'apr-util', 'several vulnerabilities', 'CVE-2009-0023 CVE-2009-1955 CVE-2009-1243', ['etch', 'apr-util', '1.2.7+dfsg-2+etch2'], ['lenny', 'apr-util', '1.2.12+dfsg-8+lenny2']] ['01-Jun-2009', 'DSA-1808-1', 'drupal6', 'insufficient input sanitising', 'CVE-2009-1844', ['lenny', 'drupal6', '6.6-3lenny2']] ['01-Jun-2009', 'DSA-1807-1', 'cyrus-sasl2 cyrus-sasl2-heimdal', 'arbitrary code execution', 'CVE-2009-0688', ['lenny', 'cyrus-sasl2-heimdal', '2.1.22.dfsg1-23+lenny1'], ['lenny', 'cyrus-sasl2', '2.1.22.dfsg1-23+lenny1'], ['etch', 'cyrus-sasl2', '2.1.22.dfsg1-8+etch1']] Thanks for everyone that offered assistance and prodding in right directions. Stefan ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parse Text File
[Hope you don't mind I copy to the list. Not only it can help others, but pyparsing users read tutor, including Paul MacGuire (author).] Le Thu, 11 Jun 2009 11:53:31 +0200, Stefan Lesicnik s'exprima ainsi: [...] I cannot really answer precisely for haven't used pyparsing for a while (*). So, below are only some hints. > Hi Denis, > > Thanks for your input. So i decided i should use a pyparser and try it (im a > relative python noob though!) > > This is what i have so far... > > import sys > from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress, > Combine, Literal, alphanums, Optional, OneOrMore, SkipTo, printables > > text=''' > [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities > {CVE-2009-0023 CVE-2009-1955} > [etch] - apr-util 1.2.7+dfsg-2+etch2 > [lenny] - apr-util 1.2.12+dfsg-8+lenny2 > ''' > > date = Combine(Literal('[') + Word(nums, exact=2) + Word(alphas) + > Word(nums, exact=4) + Literal(']'),adjacent=False) > dsa = Combine(Word(alphanums) + Literal('-') + Word(nums, exact=4) + > Literal('-') + Word(nums, exact=1),adjacent=False) > app = Combine(OneOrMore(Word(printables)) + SkipTo(Literal('-'))) > desc = Combine(Literal('-') + ZeroOrMore(Word(alphas)) + > SkipTo(Literal('\n'))) > cve = Combine(Literal('{') + OneOrMore(Literal('CVE') + Literal('-') + > Word(nums, exact=4) + Literal('-') + Word(nums, exact=4)) ) > > record = date + dsa + app + desc + cve > > fields = record.parseString(text) > #fields = dsa.parseString(text) > print fields > > > What i get out of this is > > ['[04Jun2009]', 'DSA-1812-1', 'apr-util ', '- several vulnerabilities', > '{CVE-2009-0023'] > > Which i guess it heading towards the right track... For sure! Rather impressing you could write this so fast. Hope my littel PEG grammar helped. There seems to be some detail issues, such as in the app pattern I would write ...+ SkipTo(Literal(' - ')) Also, you could directly Suppress() probably useless delimiters such as [...] in date. Think at post-parse funcs to transform and/or reformat nodes: search for setParseAction() and addParseAction() in the doc. > I am unsure why I am not getting more than 1 CVE... I have the OneOrMore > match for the CVE stuff... This is due to Combine(), that glues (back) together matched string bits. To work safely, it disables the default separator-skipping behaviour of pyparsing. So that real = Combine(integral+fractional) would correctly not match "1 .2". Right? See a recent reply by Paul MacGuire about this topic on the pyparsing list http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-users and the pointer he gives there. There are several ways to correctly cope with that. > That being said, how does the parser scale across multiple lines and how > will it know that its finished? Basically, you probably should express line breaks explicitely, esp. because they seem to be part of the source format. Otherwise, there is a func or method to define which chars should be skipped as separators (default is sp/tab if I remember well). > Should i maybe look at getting the list first into one entry per line? (must > be easier to parse then?) What makes sense I guess is Group()-ing items that *conceptually* build a list. In your case, I see: * CVS items inside {...} * version entry lines ("[etch]...", "[lenny]...", ...) * whole records at a higher level > This parsing is a mini language in itself! Sure! A kind of rather big & complex parsing language. Hard to know it all well (and I don't even speak of all builtin helpers, and even less of all what you can do by mixing ordinary python code inside the grammar/parser: a whole new field in parsing/processing). > Thanks for your input :) My pleasure... > Stefan Denis (*) The reason is I'm developping my own parsing tool; see http://spir.wikidot.com/pijnu. The guide is also intended as a parsing tutorial, it may help, but is not exactly up-to-date. -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parse Text File
On Wed, Jun 10, 2009 at 12:44 PM, Stefan Lesicnik wrote: > Hi Guys, > > I have the following text > > [08 Jun 2009] DSA-1813-1 evolution-data-server - several vulnerabilities > {CVE-2009-0547 CVE-2009-0582 CVE-2009-0587} > [etch] - evolution-data-server 1.6.3-5etch2 > [lenny] - evolution-data-server 2.22.3-1.1+lenny1 > [04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities > {CVE-2009-0023 CVE-2009-1955} > [etch] - apr-util 1.2.7+dfsg-2+etch2 > [lenny] - apr-util 1.2.12+dfsg-8+lenny2 > > ... (and a whole lot more) > > I would like to parse this so I can get it into a format I can work with. > > I don't know anything about parsers, and my brief google has made me think > im not sure I wan't to know about them quite yet! :) > (It looks very complex) > > For previous fixed string things, i would normally split each line and > address each element, but this is not the case as there could be multiple > [lenny] or even other entries. > > I would like to parse from the date to the next date and treat that all as > one element (if that makes sense) > > Does anyone have any suggestions - should I be learning a parser for doing > this? Or is there perhaps an easier way. > > Tia! > > Stefan Hello, maybe if you would show a sample on how you would like the ouput to look like it could help us give more suggestions. Regards, Eduardo ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor