[issue18946] HTMLParser should ignore errors when parsing text in script tags
New submission from James Lu: It will show invalid html inside of script tags, for example, at the learners dictionary: function output_creative (id) { document.write (div id=' + id + ' + scr + ipt type='text/javascript'\r\n + googletag.cmd.push(function() { googletag.display(' + id + '); });\r\n + /sc + ript + invalid end tag /div); }; it thinks /sc + ript is an actual end tag. -- messages: 197077 nosy: James.Lu priority: normal severity: normal status: open title: HTMLParser should ignore errors when parsing text in script tags ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18946 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18946] HTMLParser should ignore errors when parsing text in script tags
Ezio Melotti added the comment: This should be fixed in 2.7 and 3.2+. Try with a more recent version of Python and if you still have problems feel free to reopen the issue. -- components: +Library (Lib) resolution: - out of date stage: - committed/rejected status: open - closed type: - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18946 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18946] HTMLParser should ignore errors when parsing text in script tags
Ezio Melotti added the comment: What version of Python are you using? -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18946 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18946] HTMLParser should ignore errors when parsing text in script tags
James Lu added the comment: 2.5, but I don't think the library has changed since. james On Fri, Sep 6, 2013 at 12:29 PM, Ezio Melotti rep...@bugs.python.orgwrote: Ezio Melotti added the comment: What version of Python are you using? -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18946 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18946 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Parsing Text file
I have a text file like this: Sometext Somemore Somemore maskit Sometext Somemore Somemore Somemore maskit Sometext Somemore maskit I want to search for the string maskit in this file and also need to print Sometext above it..SOmetext location can vary as you can see above. In the first instance it is 3 lines above mask it, in the second instance it is 4 lines above it and so on.. Please help how to do it? -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On 2013-07-02, sas4...@gmail.com sas4...@gmail.com wrote: I have a text file like this: Sometext Somemore Somemore maskit Sometext Somemore Somemore Somemore maskit Sometext Somemore maskit I want to search for the string maskit in this file and also need to print Sometext above it..SOmetext location can vary as you can see above. In the first instance it is 3 lines above mask it, in the second instance it is 4 lines above it and so on.. Please help how to do it? How can you tell the difference between Sometext and Somemore? -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
Somemore can be anything for instance: Sometext mail maskit Sometext rupee dollar maskit and so on.. Is there a way I can achieve this? On Tuesday, July 2, 2013 2:24:26 PM UTC-5, Neil Cerutti wrote: On 2013-07-02, sas4...@gmail.com sas4...@gmail.com wrote: I have a text file like this: Sometext Somemore Somemore maskit Sometext Somemore Somemore Somemore maskit Sometext Somemore maskit I want to search for the string maskit in this file and also need to print Sometext above it..SOmetext location can vary as you can see above. In the first instance it is 3 lines above mask it, in the second instance it is 4 lines above it and so on.. Please help how to do it? How can you tell the difference between Sometext and Somemore? -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On 07/02/2013 12:30 PM, sas4...@gmail.com wrote: Somemore can be anything for instance: Sometext mail maskit Sometext rupee dollar maskit and so on.. Is there a way I can achieve this? How do we know whether we have Sometext? If it's really just a literal 'Sometext', then just print that when you hit maskit. Otherwise: for line in open('file.txt').readlines(): if is_sometext(line): memory = line if line == 'maskit': print memory -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On 2013-07-02, Tobiah t...@tobiah.org wrote: On 07/02/2013 12:30 PM, sas4...@gmail.com wrote: Somemore can be anything for instance: Sometext mail maskit Sometext rupee dollar maskit and so on.. Is there a way I can achieve this? How do we know whether we have Sometext? If it's really just a literal 'Sometext', then just print that when you hit maskit. Otherwise: for line in open('file.txt').readlines(): if is_sometext(line): memory = line if line == 'maskit': print memory Tobiah's solution fits what little we can make of your problem. My feeling is that you've simplified your question a little too much in hopes that it would help us provide a better solution. Can you provide more context? -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On 2 July 2013 20:50, Tobiah t...@tobiah.org wrote: How do we know whether we have Sometext? If it's really just a literal 'Sometext', then just print that when you hit maskit. Otherwise: for line in open('file.txt').readlines(): if is_sometext(line): memory = line if line == 'maskit': print memory My understanding of the question follows more like: # Python 3, UNTESTED memory = [] for line in open('file.txt').readlines(): if line == 'maskit': print(*memory, sep=) elif line: memory.append(line) else: memory = [] -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
Ok here is a snippet of the text file I have: config/meal/governor_mode_config.h #define GOVERNOR_MODE_TASK_RATE SSS_TID_0015MSEC #define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) #define GOVERNOR_MODE_IDLE_CHECKFALSE #define GOVERNOR_MODE_SPD_THRES 50 #define GOVERNOR_MODE_SPDDES_THRES 10 config/meal/components/source/kso_aic_core_config.h #define CEAL_KSO_AIC_CORE_TASK_RATE SSS_TID_0120MSEC #define CEAL_KSO_AIC_LOAD_FAC_AVG_TIME 300 #define CEAL_KSO_AIC_LOAD_FAC_HYST_TIME 30 #define CEAL_KSO_AIC_TEMP_DPF_INSTALLED TRUE #define CEAL_KSO_AIC_TEMP_DPF_ENABLE 450 #define CEAL_KSO_AIC_TEMP_DPF_HYST 25 #define CEAL_KSO_AIC_DPF_ROC_TIME10 #define CEAL_KSO_AIC_TEMP_EXHAUST_INSTALLED FALSE #define CEAL_KSO_AIC_TEMP_EXHAUST_ENABLE 275 #define CEAL_KSO_AIC_TEMP_EXHAUST_HYST 25 #define CEAL_KSO_AIC_EXHAUST_ROC_TIME10 #define CEAL_KSO_AIC_WORK_MODE_MASK (CEAL_MODE_WORK_MASK_GEAR | \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) #define CEAL_KSO_AIC_OV_TIME 15 Here I am looking for the line that contains: WORK_MODE_MASK, I want to print that line as well as the file name above it: config/meal/governor_mode_config.h or config/meal/components/source/ceal_PackD_kso_aic_core_config.h. SO the output should be something like this: config/meal/governor_mode_config.h #define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) config/meal/components/source/kso_aic_core_config.h #define CEAL_KSO_AIC_WORK_MODE_MASK (CEAL_MODE_WORK_MASK_GEAR | \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) I hope this helps.. Thanks for your help On Tuesday, July 2, 2013 3:12:55 PM UTC-5, Neil Cerutti wrote: On 2013-07-02, Tobiah t...@tobiah.org wrote: On 07/02/2013 12:30 PM, sas4...@gmail.com wrote: Somemore can be anything for instance: Sometext mail maskit Sometext rupee dollar maskit and so on.. Is there a way I can achieve this? How do we know whether we have Sometext? If it's really just a literal 'Sometext', then just print that when you hit maskit. Otherwise: for line in open('file.txt').readlines(): if is_sometext(line): memory = line if line == 'maskit': print memory Tobiah's solution fits what little we can make of your problem. My feeling is that you've simplified your question a little too much in hopes that it would help us provide a better solution. Can you provide more context? -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On 2 July 2013 21:28, sas4...@gmail.com wrote: Here I am looking for the line that contains: WORK_MODE_MASK, I want to print that line as well as the file name above it: config/meal/governor_mode_config.h or config/meal/components/source/ceal_PackD_kso_aic_core_config.h. SO the output should be something like this: config/meal/governor_mode_config.h #define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) config/meal/components/source/kso_aic_core_config.h #define CEAL_KSO_AIC_WORK_MODE_MASK (CEAL_MODE_WORK_MASK_GEAR | \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) (Please don't top-post.) filename = None with open(tmp.txt) as file: nonblanklines = (line for line in file if line) for line in nonblanklines: if line.lstrip().startswith(#define): defn, name, *other = line.split() if name.endswith(WORK_MODE_MASK): print(filename, line, sep=) else: filename = line Basically, you loop through remembering what lines you need, match a little bit and ignore blank lines. If this isn't a solid specification, you'll 'ave to tell me more about the edge-cases. You said that #define CEAL_KSO_AIC_WORK_MODE_MASK (CEAL_MODE_WORK_MASK_GEAR | \ CEAL_MODE_WORK_MASK_PARK_BRAKE | \ CEAL_MODE_WORK_MASK_VEHICLE_SPEED) was one line. If it is not, I suggest doing a pre-process to wrap lines with trailing \s before running the algorithm: def wrapped(lines): wrap = for line in lines: if line.rstrip().endswith(\\): wrap += line else: yield wrap + line wrap = ... nonblanklines = (line for line in wrapped(file) if line) ... This doesn't handle all wrapped lines properly, as it leaves the \ in so may interfere with matching. That's easily fixable, and there are many other ways to do this. What did you try? -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing Text file
On Tue, 02 Jul 2013 13:28:33 -0700, sas429s wrote: Ok here is a snippet of the text file I have: I hope this helps.. . Thanks for your help ok ... so you need to figure out how best to distinguish the filename, then loop through the file, remember each filename as you find it, and when you find lines containing your target text, print the current value of filename and the target text line. filenames might be distinguished by one or more of the following: They always start in column 0 and nothing else starts in column 0 They never contain spaces and all other lines contain spaces or are blank They always contain at least one / characters They always terminate with a . followed by one or more characters All the characters in them are lower case Then loop through the file in something like the following manner: open input file; open output file; for each line in input file: { if line is a filename: { thisfile = line; } elif line matches search term: { print thisfile in output file; print line in output file; } } close input file; close output file; (Note this is an algorithm written in a sort of pythonic manner, rather than actual python code - also because some newsreaders may break indenting etc, I've used ; as line terminators and {} to group blocks) -- Denis McMahon, denismfmcma...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from ethtool command
On Nov 1, 7:35 pm, Ian Kelly ian.g.ke...@gmail.com wrote: On Tue, Nov 1, 2011 at 5:19 PM, Miki Tebeka miki.teb...@gmail.com wrote: In my box, there are some spaces (tabs?) before Speed. IMO re.search(Speed, line) will be a more robust. Or simply: if Speed in line: There is no need for a regular expression here. This would also work and be a bit more discriminating: if line.strip().startswith(Speed) BTW, to the OP, note that your condition (line[0:6] == Speed) cannot match, since line[0:6] is a 6-character slice, while Speed is a 5-character string. Cheers, Ian Ian, Replacing my regular expression with line.strip().startswith did the trick. Thanks for the tip! Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from ethtool command
extraspecialbitter wrote: I'm still trying to write that seemingly simple Python script to print out network interfaces (as found in the ifconfig -a command) and their speed (ethtool interface). The idea is to loop for each interface and print out its speed. I'm looping correctly, but have some issues parsing the output for all interfaces except for the pan0 interface. I'm running on eth1, and the ifconfig -a command also shows an eth0, and of course lo. My script is trying to match on the string Speed, but I never seem to successfully enter the if clause. First, here is the output of ethtool eth1: = Settings for eth1: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: pumbag Wake-on: g Current message level: 0x0001 (1) Link detected: yes = The script *should* match on the string Speed and then assign 100Mb/ s to a variable, but is never getting past the second if statement below: = #!/usr/bin/python # Quick and dirty script to print out available interfaces and their speed # Initializations output = Interface: %s Speed: %s noinfo = (Speed Unknown) speed = noinfo import os, socket, types, subprocess fp = os.popen(ifconfig -a) dat=fp.read() dat=dat.split('\n') for line in dat: if line[10:20] == Link encap: interface=line[:9] cmd = ethtool + interface gp = os.popen(cmd) fat=gp.read() fat=fat.split('\n') for line in fat: if line[0:6] == Speed: try: speed=line[8:] except: speed=noinfo print output % (interface, speed) = Again, I appreciate everyone's patience, as I'm obviously I'm a python newbie. Thanks in advance! Hi, without starting a flamewar about regular expression, they sometimes can become usefull and really simplify code: s1 = eth0 Link encap:Ethernet HWaddr 00:1d:09:2b:d2:be inet addr:192.168.200.176 Bcast:192.168.200.255 Mask:255.255.255.0 inet6 addr: fe80::21d:9ff:fe2b:d2be/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:297475688 errors:0 dropped:7 overruns:0 frame:2 TX packets:248662722 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2795194692 (2.6 GiB) TX bytes:2702265420 (2.5 GiB) Interrupt:17 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:5595504 errors:0 dropped:0 overruns:0 frame:0 TX packets:5595504 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1601266268 (1.4 GiB) TX bytes:1601266268 (1.4 GiB) import re itfs = [section for section in s1.split('\n\n') if section and section != '\n'] # list of interfaces sections, filter the empty sections for itf in itfs: match = re.search('^(\w+)', itf) # search the word at the begining of the section interface = match and match.group(1) match = re.search('MTU:(\d+)', itf) # search for the field MTU: and capture its digital value mtu = (match and match.group(1)) or 'MTU not found' print interface, mtu eth0 1500 lo 16436 If you're not familiar with python regexp, I would advise to use kodos.py (google it), it really does help. The strong point about the code above, is that it removes all the tedious if then else logic and the arbitrary slice indexes. JM PS : I cannot test the 'Speed' because it's absent from my ifconfig display, but you should be able to figure it out :o) -- http://mail.python.org/mailman/listinfo/python-list
parsing text from ethtool command
I'm still trying to write that seemingly simple Python script to print out network interfaces (as found in the ifconfig -a command) and their speed (ethtool interface). The idea is to loop for each interface and print out its speed. I'm looping correctly, but have some issues parsing the output for all interfaces except for the pan0 interface. I'm running on eth1, and the ifconfig -a command also shows an eth0, and of course lo. My script is trying to match on the string Speed, but I never seem to successfully enter the if clause. First, here is the output of ethtool eth1: = Settings for eth1: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: pumbag Wake-on: g Current message level: 0x0001 (1) Link detected: yes = The script *should* match on the string Speed and then assign 100Mb/ s to a variable, but is never getting past the second if statement below: = #!/usr/bin/python # Quick and dirty script to print out available interfaces and their speed # Initializations output = Interface: %s Speed: %s noinfo = (Speed Unknown) speed = noinfo import os, socket, types, subprocess fp = os.popen(ifconfig -a) dat=fp.read() dat=dat.split('\n') for line in dat: if line[10:20] == Link encap: interface=line[:9] cmd = ethtool + interface gp = os.popen(cmd) fat=gp.read() fat=fat.split('\n') for line in fat: if line[0:6] == Speed: try: speed=line[8:] except: speed=noinfo print output % (interface, speed) = Again, I appreciate everyone's patience, as I'm obviously I'm a python newbie. Thanks in advance! -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from ethtool command
In my box, there are some spaces (tabs?) before Speed. IMO re.search(Speed, line) will be a more robust. -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from ethtool command
On Tue, Nov 1, 2011 at 5:19 PM, Miki Tebeka miki.teb...@gmail.com wrote: In my box, there are some spaces (tabs?) before Speed. IMO re.search(Speed, line) will be a more robust. Or simply: if Speed in line: There is no need for a regular expression here. This would also work and be a bit more discriminating: if line.strip().startswith(Speed) BTW, to the OP, note that your condition (line[0:6] == Speed) cannot match, since line[0:6] is a 6-character slice, while Speed is a 5-character string. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
iainemsley iainemsley at googlemail.com writes: Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) if textNum: print textNum else: print No scene number m = 'div type=scene' m += scene m += '\div' print m Thanks, Iain -- http://mail.python.org/mailman/listinfo/python-list Are you trying to match Roman numerals? As others have said, it is difficult to make any suggestions without knowing the input to your program. You may want to look at PyParsing (http://pyparsing.wikispaces.com/) to parse the text file without messing with regular expressions. Regards, Suraj -- http://mail.python.org/mailman/listinfo/python-list
Parsing text
Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) if textNum: print textNum else: print No scene number m = 'div type=scene' m += scene m += '\div' print m Thanks, Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
On Wed, May 6, 2009 at 2:32 PM, iainemsley iainems...@googlemail.com wrote: Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) if textNum: print textNum else: print No scene number m = 'div type=scene' m += scene m += '\div' print m Thanks, Iain Can you provide some sample input so we can recreate the problem? Also, consider something like this instead of the concatenation: m = 'div type=scene%s/div' % (scene,) -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
iainemsley wrote: Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. ...(some code)... You'll get a lot better help if you: (1) Include enough code to run and encounter the problem. Edit this down to something small (in the process, you may discover what was wrong). (2) Include actual sample data demonstrating the problem. and (3) Cut and paste the _actual_ error message and traceback from your output when running the sample code with the sample data. For extra points, identify the Python version you are using. --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
iainemsley wrote: Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) if textNum: print textNum else: print No scene number m = 'div type=scene' m += scene m += '\div' print m The problem is with your regular expression. Unfortunately, I can't tell what you're trying to match. Could you provide some examples of the scene numbers? -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) The first thing that occurs to me is that this should likely be a raw string to get those backslashes into the regexp. Compare: print ^\s\[0-9, i{1,4}, v] print r^\s\[0-9, i{1,4}, v] Without an excerpt of the actual text (or at least the lead-in for each scene), it's hard to tell whether this regex finds what you expect. It doesn't look like your regexp finds what you may think it does (it looks like you're using commas . Just so you're aware, your split is a bit fragile too, in case any lines contain Scene. However, with a proper regexp, you can even use it to split the scenes *and* tag the scene-number. Something like import re s = Scene [42] ... this is stuff in the 42nd scene ... Scene [IIV] ... stuff in the other scene ... r = re.compile(rScene\s+\[(\d+|[ivx]+)], re.I) r.split(s)[1:] ['42', '\nthis is stuff in the 42nd scene\n', 'IIV', '\nstuff in the other scene\n'] def grouper(iterable, groupby): ... iterable = iter(iterable) ... while True: ... yield [iterable.next() for _ in range(groupby)] ... for scene, content in grouper(r.split(s)[1:], 2): ... print div class='scene'h1%s/h1p%s/p/div % (scene, content) ... div class='scene'h142/h1p this is stuff in the 42nd scene /p/div div class='scene'h1IIV/h1p stuff in the other scene /p/div Play accordingly. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
iainemsley wrote: for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) Not related to your problem, but to your code - I'd write this as follows: match_scene_num = re.compile(^\s\[0-9, i{1,4}, v], re.I).match for scene_section in text.split('Scene'): text_num = match_scene_num(scene_section) This makes the code more readable and avoids unnecessary work inside the loop. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
On Wed, 06 May 2009 19:32:28 +0100, iainemsley iainems...@googlemail.com wrote: Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. With neither a sample of your data nor the traceback you get, this is going to require some crystal ball work. Assuming that all you've got is running text, I should warn you now that getting this right is a hard task. Getting it apparently right and having it fall over in a heap or badly mangle the text is, unfortunately, very easy. for scene in text.split('Scene'): Not a safe start. This will split on the word Scenery as well, for example, and doesn't guarantee you the start of a scene by a long way. num = re.compile(^\s\[0-9, i{1,4}, v], re.I) This is almost certainly not going to do what you expect, because all those backslashes in the string are going to get processed as escape characters before the string is ever passed to re.compile. Even if you fix that (by doubling the backslashes or making it a raw string), I sincerely doubt that this is the regular expression you want. As escaped, it matches in sequence: * the start of the string * a space, tab, newline or other whitespace character. Just the one. * the literal string [0-9, * either i or I repeated between 1 and four times * the literal string , * either v or V * the literal string ] Assuming you didn't mean to escape the open square bracket doesn't help: * the start of the string * one whitespace character * one of the following characters: 0123456789,iI{}vV Also, what the heck is this doing *inside* the for loop? textNum = num.match(scene) If you're using re.match(), the ^ on the regular expression is redundant. if textNum: print textNum textNum is the match object, so printing it won't tell you much. In particular, it isn't going to produce well-formed XML. else: print No scene number Nor will this. m = 'div type=scene' Missing close double quotes after 'scene'. m += scene m += '\div' print m I'm seeing nothing here that should produce an error message that has anything to do with NoneType. Any chance of (a) a more accurate code sample, (b) the traceback, or (c) sample data? -- Rhodri James *-* Wildebeeste Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
Hi, I'm trying to write a fairly basic text parser to split up scenes and acts in plays to put them into XML. I've managed to get the text split into the blocks of scenes and acts and returned correctly but I'm trying to refine this and get the relevant scene number when the split is made but I keep getting an NoneType error trying to read the block inside the for loop and nothing is being returned. I'd be grateful for some suggestions as to how to get this working. for scene in text.split('Scene'): num = re.compile(^\s\[0-9, i{1,4}, v], re.I) textNum = num.match(scene) if textNum: print textNum else: print No scene number m = 'div type=scene' m += scene m += '\div' print m Thanks, Iain Don't forget that when you split the text, the first piece you get is what came *before* the thing you split on so there won't be a scene number in the first piece. ### print 'this foo 1 and that foo 2 and the end'.split('foo') ['this ', ' 1 and that ', ' 2 and the end'] ### If you have material before the first occurrence of the word 'Scene' you will want to print that out without decoration. Also, it looks like you are trying to say with your regex that the scene number will come after some space and be a digit followed by a roman numeral of some kind(?). If the number looks like this 1iii or 2iv or then you could split your text with a regex rather than split: ### scene=re.compile('Scene\s+([0-9iIvV]+)') scene.split('The front matter Scene 1i The beginning was the best. Scene 1ii And then came the next act.') ['The front matter ', '1i', ' The beginning was the best. ', '1ii', ' And then came the next act.'] ### The \s+ indicates that there will be at least one space character and maybe more; the human error factor predicts that you will use more than one space after the word scene, so \s+ just allows for that possibility. The 0-9iIvV indicate the possible characters that might be part of your scene number. Since it's unlikely that you will have any word appearing after Scene that matches that pattern, it isn't written to be exact in specifying what should come next. [1] The parenthesis tell what (beside the pieces left by removing the split target) should be presented. In this case, the parenthesis were put around the pattern that (maybe) represented your scene number and so those are interspersed with the list of pieces. /chris [1] If it were more precise it might be '([1-9][0-9]*(iv|v?i{0,3}))' which recognizes that a number should start with 1 or above and perhaps be followed by 0 or more digits (including 0) and then come the roman numeral possibilities (for up to viii) [2]. That | indicates or and the parenthesis go around the roman numeral part to indicate that the or doesn't extend back to the decimal digits. That extra set of parenthesis also means that the split will now contain TWO captured pieces between each piece of script. If you put a ? after the scene number part meaning that it may or may not be there, None will be returned for the patterns that are not there: ### scene=re.compile('Scene\s+([1-9][0-9]*(iv|v?i{0,3}))?') scene.split('The front matter Scene 1i The beginning was the best. Scene 1ii And then came the next act. Scene The last one has no number.') ['The front matter ', '1i', 'i', ' The beginning was the best. ', '1ii', 'ii', ' And then came the next act. ', None, None, 'The last one has no number.'] ### [2] http://diveintopython.org/regular_expressions/roman_numerals.html -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
Wes James wrote: If I read a windows registry file with a line like this: {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted Multicast|Edge=FALSE| Watch out. .reg files exported from the registry are typically in UTF16. Notepad and other editors will recognise this and display what you see above, but if you were to, say, do this: print repr (open (blah.reg).read ()) You might see a different picture. If that's the case, you'll have to use the codecs module or decode the string you read. TJG -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
On Jan 30, 7:39 pm, Tim Golden m...@timgolden.me.uk wrote: Wes James wrote: If I read a windows registry file with a line like this: {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted Multicast|Edge=FALSE| Watch out. .reg files exported from the registry are typically in UTF16. Notepad and other editors will recognise this and display what you see above, but if you were to, say, do this: print repr (open (blah.reg).read ()) You might see a different picture. If that's the case, you'll have to use the codecs module or decode the string you read. Ha! That's why it appeared to print LAND instead of LANDesk -- it found and was printing L\0A\0N\0D. -- http://mail.python.org/mailman/listinfo/python-list
parsing text from a file
If I read a windows registry file with a line like this: {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted Multicast|Edge=FALSE| with this code: f=open('fwrules.reg2.txt') for s in f: if s.find('LANDesk') 0: print s, LANDesk is not found. Also this does not work: for s in f: try: i=s.index('L') print s[i:i+7] except: pass all it prints is LAND how do I find LANDesk in a string like this. is the \\ messing things up? thx, -wj -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
2009/1/29 Wes James compte...@gmail.com: If I read a windows registry file with a line like this: ... with this code: f=open('fwrules.reg2.txt') for s in f: if s.find('LANDesk') 0: print s, LANDesk is not found. how do I find LANDesk in a string like this. is the \\ messing things up? ... thx, -wj Hi, if s.find('LANDesk') 0: is True for a line which doesn't contain LANDesk; if you want the opposite, try if s.find('LANDesk') -1: hth vbr -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
On Jan 30, 8:54 am, Wes James compte...@gmail.com wrote: If I read a windows registry file with a line like this: {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted Multicast|Edge=FALSE| with this code: f=open('fwrules.reg2.txt') for s in f: if s.find('LANDesk') 0: print s, LANDesk is not found. You mean it's not printed. That code prints all lines that don't contain LANDesk Also this does not work: for s in f: try: i=s.index('L') print s[i:i+7] except: Using except ValueError: would be safer. pass all it prints is LAND AFAICT your reported outcome is impossible given that such a line exists in the file. how do I find LANDesk in a string like this. What you were trying (second time, or first time (with =) should work. I suggest that to diagnose your problem you change the second snippet as follows: 1. use except ValueError: 2. print s, len(s), i, and s.find('L') for all lines is the \\ messing things up? Each \\ is presumably just the repr() of a single backslash. In any case whether there are 0,1,2 or many backslashes in a line or the repr () thereof has nothing to do with your problem. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
if s.find('LANDesk') 0: is True for a line which doesn't contain LANDesk; if you want the opposite, try if s.find('LANDesk') -1: Or more pythonically, just use if 'LANDesk' in s: -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text from a file
Wes James wrote: If I read a windows registry file with a line like this: {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted Multicast|Edge=FALSE| with this code: f=open('fwrules.reg2.txt') for s in f: if s.find('LANDesk') 0: print s, LANDesk is not found. Also this does not work: for s in f: try: i=s.index('L') print s[i:i+7] except: pass all it prints is LAND how do I find LANDesk in a string like this. is the \\ messing things up? How do you know what's in the file? Did you use an editor? It might be that the file contents are encoded in, say, UTF-16 and the editor is detecting that and decoding it for you, but Python's open() function is just returning the contents as a bytestring (Python 2.x). Try: import codecs f = codecs.open('fwrules.reg2.txt', encoding='UTF-16') for s in f: if u'LANDesk' in s: print s, f.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text file with #include and #define directives
Arnaud, Wow!!! That's beautiful. Thank you very much! Malcolm snip I think it's straightforward enough to be dealt with simply. Here is a solution that doesn't handle errors but should work with well-formed input and handles recursive expansions. expand(filename) returns an iterator over expanded lines in the file, inserting lines of included files. import re def expand(filename): defines = {} def define_repl(matchobj): return defines[matchobj.group(1)] define_regexp = re.compile('#(.+?)#') for line in open(filename): if line.startswith('#include '): recfilename = line.strip().split(None, 1)[1] for recline in expand(recfilename): yield recline elif line.startswith('#define '): _, name, value = line.strip().split(None, 2) defines[name] = value else: yield define_regexp.sub(define_repl, line) It would be easy to modify it to keep track of line numbers and file names. /snip -- http://mail.python.org/mailman/listinfo/python-list
Parsing text file with #include and #define directives
I'm parsing a text file for a proprietary product that has the following 2 directives: #include somefile #define name value Defined constants are referenced via #name# syntax. I'm looking for a single text stream that results from processing a file containing these directives. Even better would be an iterator(?) type object that tracked file names and line numbers as it returns individual lines. Is there a Python parsing library to handle this type of task or am I better off writing my own? The effort to write one from scratch doesn't seem too difficult (minus recursive file and constant loops), but I wanted to avoid re-inventing the wheel if this type of component already exists. Thank you, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text file with #include and #define directives
[EMAIL PROTECTED] writes: I'm parsing a text file for a proprietary product that has the following 2 directives: #include somefile #define name value Defined constants are referenced via #name# syntax. I'm looking for a single text stream that results from processing a file containing these directives. Even better would be an iterator(?) type object that tracked file names and line numbers as it returns individual lines. Is there a Python parsing library to handle this type of task or am I better off writing my own? The effort to write one from scratch doesn't seem too difficult (minus recursive file and constant loops), but I wanted to avoid re-inventing the wheel if this type of component already exists. Thank you, Malcolm I think it's straightforward enough to be dealt with simply. Here is a solution that doesn't handle errors but should work with well-formed input and handles recursive expansions. expand(filename) returns an iterator over expanded lines in the file, inserting lines of included files. import re def expand(filename): defines = {} def define_repl(matchobj): return defines[matchobj.group(1)] define_regexp = re.compile('#(.+?)#') for line in open(filename): if line.startswith('#include '): recfilename = line.strip().split(None, 1)[1] for recline in expand(recfilename): yield recline elif line.startswith('#define '): _, name, value = line.strip().split(None, 2) defines[name] = value else: yield define_regexp.sub(define_repl, line) It would be easy to modify it to keep track of line numbers and file names. HTH -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
parsing text in blocks and line too
Goodmorning people :) I have just started to learn this language and i have a logical problem. I need to write a program to parse various file of text. Here two sample: --- trial text bla bla bla bla error bla bla bla bla bla bla bla bla on more lines trial text bla bla bla bla warning bla bla bla more bla to be grouped with warning bla bla bla on more lines could be one two or ten lines also withouth the tab beginning again text text can contain also blank lines text no delimiters -- Apr 8 04:02:08 machine text on one line Apr 8 04:02:09 machine this is an error Apr 8 04:02:10 machine this is a warning -- parsing the file, I'll need to decide if the line/group is an error, warning or to skip. Mine problem if how logical do it: if i read line by line, I'll catch the error/warning on first and the second/third/more will be skipped by control. Reading a group of line i could lose the order on the output: my idea is to have an output in html with the line in the color of the check (yellow for warning, red for error). And i have also many rules to be followed so if i read one rule and then i search on the entire file, the check will be really slow. Hope someone could give me some tips. Thanks in advance -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text in blocks and line too
On 2007-04-12, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Goodmorning people :) I have just started to learn this language and i have a logical problem. I need to write a program to parse various file of text. Here two sample: --- trial text bla bla bla bla error bla bla bla bla bla bla bla bla on more lines trial text bla bla bla bla warning bla bla bla more bla to be grouped with warning bla bla bla on more lines could be one two or ten lines also withouth the tab beginning again text text can contain also blank lines text no delimiters -- Apr 8 04:02:08 machine text on one line Apr 8 04:02:09 machine this is an error Apr 8 04:02:10 machine this is a warning -- I would first read groups of lines that belong together, then decide on each group whether it is an error, warning, or whatever. To preserve order in a group of lines, you can use lists. From your example you could first compute a list of lists, like [ [ trial text bla bla bla bla error, bla bla bla bla bla, bla bla bla on more lines ], [ trial text bla bla bla bla warning bla, bla bla more bla to be grouped with warning, bla bla bla on more lines, could be one two or ten lines also withouth the tab beginning ], [ again text ], [ text can contain also blank lines ], [ ], [ text no delimiters ] ] Just above the text no delimiters line I have added an empty line, and I translated that to an empty group of lines (denoted with the empty list). By traversing the groups (ie over the outermost list), you can now decide for each group what type of output it is, and act accordingly. Hope someone could give me some tips. Sure, however, in general it is appreciated if you first show your own efforts before asking the list for a solution. Albert -- http://mail.python.org/mailman/listinfo/python-list
Re: parsing text in blocks and line too
A.T.Hofkamp wrote: On 2007-04-12, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Goodmorning people :) I have just started to learn this language and i have a logical problem. I need to write a program to parse various file of text. Here two sample: --- trial text bla bla bla bla error bla bla bla bla bla bla bla bla on more lines trial text bla bla bla bla warning bla bla bla more bla to be grouped with warning bla bla bla on more lines could be one two or ten lines also withouth the tab beginning again text text can contain also blank lines text no delimiters -- Apr 8 04:02:08 machine text on one line Apr 8 04:02:09 machine this is an error Apr 8 04:02:10 machine this is a warning -- I would first read groups of lines that belong together, then decide on each group whether it is an error, warning, or whatever. To preserve order in a group of lines, you can use lists. From your example you could first compute a list of lists, like [ [ trial text bla bla bla bla error, bla bla bla bla bla, bla bla bla on more lines ], [ trial text bla bla bla bla warning bla, bla bla more bla to be grouped with warning, bla bla bla on more lines, could be one two or ten lines also withouth the tab beginning ], [ again text ], [ text can contain also blank lines ], [ ], [ text no delimiters ] ] Just above the text no delimiters line I have added an empty line, and I translated that to an empty group of lines (denoted with the empty list). By traversing the groups (ie over the outermost list), you can now decide for each group what type of output it is, and act accordingly. Hope someone could give me some tips. Sure, however, in general it is appreciated if you first show your own efforts before asking the list for a solution. Albert If groups have 0 indent first line and other lines in the group are indented, group the lines blocks = [] block = [] for line in lines: if not line.startswith(' '): if block: blocks.append(block) block = [] block.append(line) if block: blocks.append(block) But if 0 indent doesn't start a new block, don't expect this to work, but that is what I infer from your limited sample. You can then look for warnings, etc., in the blocks--either in the loop to save memory or in the constructed blocks list. James -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
On 19 Dec 2005 15:15:10 -0800, sicvic [EMAIL PROTECTED] wrote: I was wondering if theres a way where python can read through the lines of a text file searching for a key phrase then writing that line and all lines following it up to a certain point, such as until it sees a string of - Right now I can only have python write just the line the key phrase is found in. This sounds like homework, so just a (big) hint: have a look at itertools dropwhile and takewhile. The solution is potentially a one-liner, depending on your matching criteria (e.g., case-sensitive fixed string vs regular expression). Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
Not homework...not even in school (do any universities even teach classes using python?). Just not a programmer. Anyways I should probably be more clear about what I'm trying to do. Since I cant show the actual output file lets say I had an output file that looked like this: a b Person: Jimmy Current Location: Denver Next Location: Chicago -- a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York -- Now I want to put (and all recurrences of Person: Jimmy) Person: Jimmy Current Location: Denver Next Location: Chicago in a file called jimmy.txt and the same for Sarah in sarah.txt The code I currently have looks something like this: import re import sys person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt person_sarah = open('sarah.txt', 'w') #creates sarah.txt f = open(sys.argv[1]) #opens output file #loop that goes through all lines and parses specified text for line in f.readlines(): if re.search(r'Person: Jimmy', line): person_jimmy.write(line) elif re.search(r'Person: Sarah', line): person_sarah.write(line) #closes all files person_jimmy.close() person_sarah.close() f.close() However this only would produces output files that look like this: jimmy.txt: a b Person: Jimmy sarah.txt: a b Person: Sarah My question is what else do I need to add (such as an embedded loop where the if statements are?) so the files look like this a b Person: Jimmy Current Location: Denver Next Location: Chicago and a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York Basically I need to add statements that after finding that line copy all the lines following it and stopping when it sees '--' Any help is greatly appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
sicvic [EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]: Not homework...not even in school (do any universities even teach classes using python?). Just not a programmer. Anyways I should probably be more clear about what I'm trying to do. Since I cant show the actual output file lets say I had an output file that looked like this: a b Person: Jimmy Current Location: Denver Next Location: Chicago -- a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York -- Now I want to put (and all recurrences of Person: Jimmy) Person: Jimmy Current Location: Denver Next Location: Chicago in a file called jimmy.txt and the same for Sarah in sarah.txt The code I currently have looks something like this: import re import sys person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt person_sarah = open('sarah.txt', 'w') #creates sarah.txt f = open(sys.argv[1]) #opens output file #loop that goes through all lines and parses specified text for line in f.readlines(): if re.search(r'Person: Jimmy', line): person_jimmy.write(line) elif re.search(r'Person: Sarah', line): person_sarah.write(line) #closes all files person_jimmy.close() person_sarah.close() f.close() However this only would produces output files that look like this: jimmy.txt: a b Person: Jimmy sarah.txt: a b Person: Sarah My question is what else do I need to add (such as an embedded loop where the if statements are?) so the files look like this a b Person: Jimmy Current Location: Denver Next Location: Chicago and a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York Basically I need to add statements that after finding that line copy all the lines following it and stopping when it sees '--' Any help is greatly appreciated. Something like this, maybe? This iterates through a file, with subloops to handle the special cases. I'm assuming that Jimmy and Sarah are not the only people of interest. I'm also assuming (for no very good reason) that you do want the separator lines, but do not want the Person: lines in the output file. It is easy enough to adjust those assumptions to taste. Each Person: line will cause a file to be opened (if it is not already open, and will write the subsequent lines to it until the separator is found. Be aware that all files remain open unitl the loop at the end closes them all. outfs = {} f = open('shouldBeDatabase.txt') for line in f: if line.find('Person:') = 0: ofkey = line[line.find('Person:')+7:].strip() if not ofkey in outfs: outfs[ofkey] = open('%s.txt' % ofkey, 'w') outf = outfs[ofkey] while line.find('-') 0: line = f.next() outf.write('%s' % line) f.close() for k,v in outfs.items(): v.close() -- rzed -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
sicvic wrote: Since I cant show the actual output file lets say I had an output file that looked like this: a b Person: Jimmy Current Location: Denver It may be the output of another process but it's the input file as far as the parsing code is concerned. The code below gives the following output, if that's any help ( just adapting Noah's idea above). Note that it deals with the input as a single string rather than line by line. Jimmy Jimmy.txt Current Location: Denver Next Location: Chicago Sarah Sarah.txt Current Location: San Diego Next Location: Miami Next Location: New York data=''' a b Person: Jimmy Current Location: Denver Next Location: Chicago -- a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York -- ''' import StringIO import re src = StringIO.StringIO(data) for name in ['Jimmy', 'Sarah']: exp = (?s)Person: %s(.*?)-- % name filename = %s.txt % name info = re.findall(exp, src.getvalue())[0] print name print filename print info hth Gerard -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
sicvic wrote: Not homework...not even in school (do any universities even teach classes using python?). Yup, at least 6, and 20 wouldn't surprise me. The code I currently have looks something like this: ... f = open(sys.argv[1]) #opens output file #loop that goes through all lines and parses specified text for line in f.readlines(): if re.search(r'Person: Jimmy', line): person_jimmy.write(line) elif re.search(r'Person: Sarah', line): person_sarah.write(line) Using re here seems pretty excessive. How about: ... f = open(sys.argv[1]) # opens input file ### get comments right source = iter(f) # files serve lines at their own pace. Let them for line in source: if line.endswith('Person: Jimmy\n'): dest = person_jimmy elif line.endswith('Person: Sarah\n'): dest = person_sarah else: continue while line != '---\n': dest.write(line) line = source.next() f.close() person_jimmy.close() person_sarah.close() --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
Thank you everyone!!! I got a lot more information then I expected. You guys got my brain thinking in the right direction and starting to like programming. You've got a great community here. Keep it up. Thanks, Victor -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
On 20 Dec 2005 08:06:39 -0800, sicvic [EMAIL PROTECTED] wrote: Not homework...not even in school (do any universities even teach classes using python?). Just not a programmer. Anyways I should probably be more clear about what I'm trying to do. Ok, not homework. Since I cant show the actual output file lets say I had an output file that looked like this: a b Person: Jimmy Current Location: Denver Next Location: Chicago -- a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York -- Now I want to put (and all recurrences of Person: Jimmy) Person: Jimmy Current Location: Denver Next Location: Chicago in a file called jimmy.txt and the same for Sarah in sarah.txt The code I currently have looks something like this: import re import sys person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt person_sarah = open('sarah.txt', 'w') #creates sarah.txt f = open(sys.argv[1]) #opens output file #loop that goes through all lines and parses specified text for line in f.readlines(): if re.search(r'Person: Jimmy', line): person_jimmy.write(line) elif re.search(r'Person: Sarah', line): person_sarah.write(line) #closes all files person_jimmy.close() person_sarah.close() f.close() However this only would produces output files that look like this: jimmy.txt: a b Person: Jimmy sarah.txt: a b Person: Sarah My question is what else do I need to add (such as an embedded loop where the if statements are?) so the files look like this a b Person: Jimmy Current Location: Denver Next Location: Chicago and a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York Basically I need to add statements that after finding that line copy all the lines following it and stopping when it sees '--' Any help is greatly appreciated. Ok, I generalized on your theme of extracting file chunks to named files, where the beginning line has the file name. I made '.txt' hardcoded extension. I provided a way to direct the output to a (I guess not necessarily sub) directory Not tested beyond what you see. Tweak to suit. extractfilesegs.py Usage: [python] extractfilesegs [source [outdir [startpat [endpat where source is -tf for test file, a file name, or an open file outdir is a directory prefix that will be joined to output file names startpat is a regular expression with group 1 giving the extracted file name endpat is a regular expression whose match line is excluded and ends the segment import re, os def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30): rxstart = re.compile(start) rxstop = re.compile(stop) if isinstance(linesrc, basestring): linesrc = open(linesrc) lineit = iter(linesrc) files = [] for line in lineit: match = rxstart.search(line) if not match: continue name = match.group(1) filename = name.lower() + '.txt' filename = os.path.join(outdir, filename) #print 'opening file %r'%filename files.append(filename) fout = open(filename, 'a') # append in case repeats? fout.write(match.group(0)+'\n') # did you want aaa bbb stuff? for data_line in lineit: if rxstop.search(data_line): #print 'closing file %r'%filename fout.close() # don't write line with ending mark fout = None break else: fout.write(data_line) if fout: fout.close() print 'file %r ended with source file EOF, not stop mark'%filename return files def get_testfile(): from StringIO import StringIO return StringIO(\ ...irrelevant leading stuff ... a b Person: Jimmy Current Location: Denver Next Location: Chicago -- a b Person: Sarah Current Location: San Diego Next Location: Miami Next Location: New York -- irrelevant trailing stuff ... with a blank line ) if __name__ == '__main__': import sys args = sys.argv[1:] if not args: raise SystemExit(__doc__) tf = args.pop(0) if tf=='-tf': fin = get_testfile() else: fin = tf if not args: files = extractFileSegs(fin) elif len(args)==1: files = extractFileSegs(fin, args[0]) elif len(args)==2: files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line? else: files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1]) print '\nFiles created:' for fname in files: print '%s'% fname if tf == '-tf': for fpath in files: print ' %s
Parsing text
I was wondering if theres a way where python can read through the lines of a text file searching for a key phrase then writing that line and all lines following it up to a certain point, such as until it sees a string of - Right now I can only have python write just the line the key phrase is found in. Thanks, Victor -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
sicvic wrote: I was wondering if theres a way where python can read through the lines of a text file searching for a key phrase then writing that line and all lines following it up to a certain point, such as until it sees a string of - Right now I can only have python write just the line the key phrase is found in. That's a good start. Maybe you could post the code that you've already got that does this, and people could comment on it and help you along. (I'm suggesting that partly because this almost sounds like homework, but you'll benefit more by doing it this way than just by having an answer handed to you whether this is homework or not.) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text
sicvic wrote: I was wondering if theres a way where python can read through the lines of a text file searching for a key phrase then writing that line and all lines following it up to a certain point, such as until it sees a string of - ... Thanks, Victor You did not specify the key phrase that you are looking for, so for the sake of this example I will assume that it is key phrase. I assume that you don't want key phrase or - to be returned as part of your match, so we use minimal group matching (.*?) You also want your regular expression to use the re.DOTALL flag because this is how you match across multiple lines. The simplest way to set this flag is to simply put it at the front of your regular expression using the (?s) notation. This gives you something like this: print re.findall ((?s)key phrase(.*?)-, your_string_to_search) [0] So what that basically says is: 1. Match multiline -- that is, match across lines (?s) 2. match key phrase 3. Capture the group matching everything (?.*) 4. Match - 5. Print the first match in the list [0] Yours, Noah -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
On Tue, 17 May 2005 16:44:12 -0500, Mike Meyer [EMAIL PROTECTED] wrote: Thomas W [EMAIL PROTECTED] writes: I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Why are you making it possible for the users to screw this up? Don't give them a text widget to fill in and you have to figure out what the format is, give them three widgets so you *know* what's what. In doing that, you can also go to dropdown widgets for month, with month names (in a locale appropriate for the page language), and for the days in the month. My experience: drop-down lists generate off-by-one errors. They also annoy the bejaysus out of users -- e.g. year of birth, a 60+ element list. It's quite possible of course that YMMV :-) BTW: I have seen a web page with a drop-down list for year of birth where the first 18 entries were current year, current year - 1, etc for a transaction that wasn't for minors. -- http://mail.python.org/mailman/listinfo/python-list
Parsing text into dates?
I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Besides the actual parsing, my main concern is the different locale date formats and how to be able to parse those strange us-like month/day/year compared to the clever and intuitive european-style day/month/year etc. I've searched google, but haven't found any good referances that helped me solve this problem, especially with regards to the locale date format issues. Best regards, Thomas -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
On 16 May 2005 13:59:31 -0700, Thomas W [EMAIL PROTECTED] wrote: I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Besides the actual parsing, my main concern is the different locale date formats and how to be able to parse those strange us-like month/day/year compared to the clever and intuitive european-style day/month/year etc. rant Well I'm from a locale that uses the dd/mm/ style and I think it's only marginally less stupid than the mm/dd/ style. /rant How much intuition is required to determine in an international context what was meant by 01/12/2004? First of December or 12th of January? The consequences of misinterpretation can be enormous. If this application is being deployed from a central server where the users can be worldwide, you have two options: (a) try to work out somehow what the user's locale is, and then work with dates in the legacy format appropriate to the locale. (b) Use the considerably-less-stupid ISO 8601 standard format -mm-dd (e.g. 2004-12-01) -- throughout your web-application, not just in your data entry. Having said all of that, [bottom-up question] how are you handling locale differences in language, script, currency symbol, decimal point, thousands separator, postal address formats, surname / given-name order, etc etc etc? [top-down question] What *is* your target audience? -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
John Machin wrote: If this application is being deployed from a central server where the users can be worldwide, you have two options: (a) try to work out somehow what the user's locale is, and then work with dates in the legacy format appropriate to the locale. And this inevitably screws a large number of Canadians (and probably others), those poor conflicted folk caught between their European roots and their American neighbours, some of whom use mm/dd/yy and others of whom use dd/mm/yy on a regular basis. And some of us who switch willy-nilly, much as we do between metric and imperial. :-( (b) Use the considerably-less-stupid ISO 8601 standard format -mm-dd (e.g. 2004-12-01) -- throughout your web-application, not just in your data entry. +1 (emphatically!) (I almost always use this form even on government submissions, and nobody has complained yet. Of course, they haven't started changing the forms yet, either...) -Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
Thomas W wrote: I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Besides the actual parsing, my main concern is the different locale date formats and how to be able to parse those strange us-like month/day/year compared to the clever and intuitive european-style day/month/year etc. I've searched google, but haven't found any good referances that helped me solve this problem, especially with regards to the locale date format issues. Best regards, Thomas Although it is not a solution to the general localization problem, you may try the mx.DateTimeFrom() factory function (http://www.egenix.com/files/python/mxDateTime.html#DateTime) for the parsing part. I had also written some time ago a more robust and customized version of such parser. The ambiguous us/european style dates are disambiguated by the provided optional argument USA (False by default wink). Below is the doctest and the documentation (with epydoc tags); mail me offlist if you'd like to check it out. George #=== def parseDateTime(string, USA=False, implyCurrentDate=False, yearHeuristic=_20thcenturyHeuristic): '''Tries to parse a string as a valid date and/or time. It recognizes most common (and less common) date and time formats. Examples: # doctest was run succesfully on... str(datetime.date.today()) '2005-05-16' str(parseDateTime('21:23:39.91')) '21:23:39.91' str(parseDateTime('16:15')) '16:15:00' str(parseDateTime('10am')) '10:00:00' str(parseDateTime('2:7:18.')) '02:07:18' str(parseDateTime('08:32:40 PM')) '20:32:40' str(parseDateTime('11:59pm')) '23:59:00' str(parseDateTime('12:32:9')) '12:32:09' str(parseDateTime('12:32:9', implyCurrentDate=True)) '2005-05-16 12:32:09' str(parseDateTime('93/7/18')) '1993-07-18' str(parseDateTime('15.6.2001')) '2001-06-15' str(parseDateTime('6.15.2001')) '2001-06-15' str(parseDateTime('1980, November 20')) '1980-11-20' str(parseDateTime('4 Mar 79')) '1979-03-04' str(parseDateTime('July 4')) '2005-07-04' str(parseDateTime('15/08')) '2005-08-15' str(parseDateTime('5 Mar 3:45pm')) '2005-03-05 15:45:00' str(parseDateTime('01 02 2003')) '2003-02-01' str(parseDateTime('01 02 2003', USA=True)) '2003-01-02' str(parseDateTime('3/4/92')) '1992-04-03' str(parseDateTime('3/4/92', USA=True)) '1992-03-04' str(parseDateTime('12:32:09 1-2-2003')) '2003-02-01 12:32:09' str(parseDateTime('12:32:09 1-2-2003', USA=True)) '2003-01-02 12:32:09' str(parseDateTime('3:45pm 5 12 2001')) '2001-12-05 15:45:00' str(parseDateTime('3:45pm 5 12 2001', USA=True)) '2001-05-12 15:45:00' @param USA: Disambiguates strings that are valid dates in both (month, day, year) and (day, month, year) order (e.g. 05/03/2002). If True, the first format is assumed. @param implyCurrentDate: If True and the date is not given, the current date is implied. @param yearHeuristic: If not None, a callable f(year) that transforms the value of the given year. The default heuristic transforms 2-digit years to 4-digit years assuming they are in the 20th century:: lambda year: (year = 100 and year or year = 10 and 1900 + year or None) The heuristic should return None if the year is not considered valid. If yearHeuristic is None, no year transformation takes place. @return: - C{datetime.date} if only the date is recognized. - C{datetime.time} if only the time is recognized and implyCurrentDate is False. - C{datetime.datetime} if both date and time are recognized. @raise ValueError: If the string cannot be parsed successfully. ''' -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
On 16 May 2005 17:51:31 -0700, George Sakkis [EMAIL PROTECTED] wrote: #=== def parseDateTime(string, USA=False, implyCurrentDate=False, yearHeuristic=_20thcenturyHeuristic): '''Tries to parse a string as a valid date and/or time. It recognizes most common (and less common) date and time formats. Impressive! Examples: [snip] str(parseDateTime('15.6.2001')) '2001-06-15' str(parseDateTime('6.15.2001')) '2001-06-15' A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily typoed into 6.13.2001 or 6.15.2001 on the numeric keypad. -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
John Machin [EMAIL PROTECTED] wrote: On 16 May 2005 17:51:31 -0700, George Sakkis [EMAIL PROTECTED] wrote: #=== def parseDateTime(string, USA=False, implyCurrentDate=False, yearHeuristic=_20thcenturyHeuristic): '''Tries to parse a string as a valid date and/or time. It recognizes most common (and less common) date and time formats. Impressive! Examples: [snip] str(parseDateTime('15.6.2001')) '2001-06-15' str(parseDateTime('6.15.2001')) '2001-06-15' A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily typoed into 6.13.2001 or 6.15.2001 on the numeric keypad. Sure, but how is this different from a typo of 2001-12-07 instead of 2001-12-06 ? There's no way you can catch all typos anyway by parsing alone. Besides, 6.15.2001 is to be interpreted as 2001-06-15 in US format. Currently the 'USA' flag is used only for ambiguous dates, but that's easy to change to apply to all dates. Essentially you would gain a little extra safety at the expense of a little lost recall over the set of parseable dates. George -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
Thomas W [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Besides the actual parsing, my main concern is the different locale date formats and how to be able to parse those strange us-like month/day/year compared to the clever and intuitive european-style day/month/year etc. I've searched google, but haven't found any good referances that helped me solve this problem, especially with regards to the locale date format issues. There is no easy answer if you want to be able to enter three numbers. There are two answers that work, although there will be a lot of complaining. One is to use the international -mm-dd form, and the other is to accept a 4 digit year, an alphabetic month and a two digit day in any order. Otherwise, if you get 4 digits as the first component, and it passes your validation (whatever that is) for reasonable years, you're probably pretty safe to assume that you've got -mm-dd. Otherwise if you can't get a clean answser (one is 31, one is 12 x 32 and one is = 12, just give them a list of possibilities and politely suggest that they enter it as -mm-dd next time. I don't validate separators. As long as there is something that isn't a number or a letter, it's a separator and which one doesn't matter. At times I've even taken the transition between a digit and a letter as a separator. John Roth Best regards, Thomas -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing text into dates?
The beautiful brand new cookbook2 has Fuzzy parsing of Dates using dateutil.parser, which you run once you have a decent guess at locale (page 127 of cookbook) John Roth wrote: Thomas W [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I'm developing a web-application where the user sometimes has to enter dates in plain text, allthough a format may be provided to give clues. On the server side this piece of text has to be parsed into a datetime python-object. Does anybody have any pointers on this? Besides the actual parsing, my main concern is the different locale date formats and how to be able to parse those strange us-like month/day/year compared to the clever and intuitive european-style day/month/year etc. I've searched google, but haven't found any good referances that helped me solve this problem, especially with regards to the locale date format issues. There is no easy answer if you want to be able to enter three numbers. There are two answers that work, although there will be a lot of complaining. One is to use the international -mm-dd form, and the other is to accept a 4 digit year, an alphabetic month and a two digit day in any order. Otherwise, if you get 4 digits as the first component, and it passes your validation (whatever that is) for reasonable years, you're probably pretty safe to assume that you've got -mm-dd. Otherwise if you can't get a clean answser (one is 31, one is 12 x 32 and one is = 12, just give them a list of possibilities and politely suggest that they enter it as -mm-dd next time. I don't validate separators. As long as there is something that isn't a number or a letter, it's a separator and which one doesn't matter. At times I've even taken the transition between a digit and a letter as a separator. John Roth Best regards, Thomas -- http://mail.python.org/mailman/listinfo/python-list