subject:"Parsing text"

[issue18946] HTMLParser should ignore errors when parsing text in script tags

2013-09-06 Thread James Lu


New submission from James Lu:

It will show invalid html inside of script tags, for example, at the learners 
dictionary:
function output_creative (id)
{   document.write
(div id=' + id + ' + 
scr + ipt 
type='text/javascript'\r\n + 
googletag.cmd.push(function() { 
googletag.display(' + id + '); });\r\n +
/sc + ript + invalid end tag
/div);
};
it thinks /sc + ript is an actual end tag.

--
messages: 197077
nosy: James.Lu
priority: normal
severity: normal
status: open
title: HTMLParser should ignore errors when parsing text in script tags

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18946
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18946] HTMLParser should ignore errors when parsing text in script tags

2013-09-06 Thread Ezio Melotti


Ezio Melotti added the comment:

This should be fixed in 2.7 and 3.2+.
Try with a more recent version of Python and if you still have problems feel 
free to reopen the issue.

--
components: +Library (Lib)
resolution:  - out of date
stage:  - committed/rejected
status: open - closed
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18946
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18946] HTMLParser should ignore errors when parsing text in script tags

2013-09-06 Thread Ezio Melotti


Ezio Melotti added the comment:

What version of Python are you using?

--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18946
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18946] HTMLParser should ignore errors when parsing text in script tags

2013-09-06 Thread James Lu


James Lu added the comment:

2.5, but I don't think the library has changed since.

james

On Fri, Sep 6, 2013 at 12:29 PM, Ezio Melotti rep...@bugs.python.orgwrote:


 Ezio Melotti added the comment:

 What version of Python are you using?

 --
 nosy: +ezio.melotti

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue18946
 ___


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18946
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Parsing Text file

2013-07-02 Thread sas429s

I have a text file like this:

Sometext
Somemore
Somemore
maskit

Sometext
Somemore
Somemore
Somemore
maskit

Sometext
Somemore
maskit

I want to search for the string maskit in this file and also need to print 
Sometext above it..SOmetext location can vary as you can see above.

In the first instance it is 3 lines above mask it, in the second instance it is 
4 lines above it and so on..

Please help how to do it?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Neil Cerutti

On 2013-07-02, sas4...@gmail.com sas4...@gmail.com wrote:
 I have a text file like this:

 Sometext
 Somemore
 Somemore
 maskit

 Sometext
 Somemore
 Somemore
 Somemore
 maskit

 Sometext
 Somemore
 maskit

 I want to search for the string maskit in this file and also
 need to print Sometext above it..SOmetext location can vary as
 you can see above.

 In the first instance it is 3 lines above mask it, in the
 second instance it is 4 lines above it and so on..

 Please help how to do it?

How can you tell the difference between Sometext and Somemore?

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread sas429s

Somemore can be anything for instance:

Sometext
mail
maskit

Sometext
rupee
dollar
maskit

and so on..

Is there a way I can achieve this?

On Tuesday, July 2, 2013 2:24:26 PM UTC-5, Neil Cerutti wrote:
 On 2013-07-02, sas4...@gmail.com sas4...@gmail.com wrote:
 
  I have a text file like this:
 
 
 
  Sometext
 
  Somemore
 
  Somemore
 
  maskit
 
 
 
  Sometext
 
  Somemore
 
  Somemore
 
  Somemore
 
  maskit
 
 
 
  Sometext
 
  Somemore
 
  maskit
 
 
 
  I want to search for the string maskit in this file and also
 
  need to print Sometext above it..SOmetext location can vary as
 
  you can see above.
 
 
 
  In the first instance it is 3 lines above mask it, in the
 
  second instance it is 4 lines above it and so on..
 
 
 
  Please help how to do it?
 
 
 
 How can you tell the difference between Sometext and Somemore?
 
 
 
 -- 
 
 Neil Cerutti

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Tobiah


On 07/02/2013 12:30 PM, sas4...@gmail.com wrote:

Somemore can be anything for instance:

Sometext
mail
maskit

Sometext
rupee
dollar
maskit

and so on..

Is there a way I can achieve this?


How do we know whether we have Sometext?
If it's really just a literal 'Sometext', then
just print that when you hit maskit.

Otherwise:


for line in open('file.txt').readlines():

if is_sometext(line):
memory = line

if line == 'maskit':
print memory


--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Neil Cerutti

On 2013-07-02, Tobiah t...@tobiah.org wrote:
 On 07/02/2013 12:30 PM, sas4...@gmail.com wrote:
 Somemore can be anything for instance:

 Sometext
 mail
 maskit

 Sometext
 rupee
 dollar
 maskit

 and so on..

 Is there a way I can achieve this?

 How do we know whether we have Sometext?
 If it's really just a literal 'Sometext', then
 just print that when you hit maskit.

 Otherwise:


 for line in open('file.txt').readlines():
   
   if is_sometext(line):
   memory = line

   if line == 'maskit':
   print memory

Tobiah's solution fits what little we can make of your problem.

My feeling is that you've simplified your question a little too
much in hopes that it would help us provide a better solution.
Can you provide more context? 

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Joshua Landau

On 2 July 2013 20:50, Tobiah t...@tobiah.org wrote:
 How do we know whether we have Sometext?
 If it's really just a literal 'Sometext', then
 just print that when you hit maskit.

 Otherwise:


 for line in open('file.txt').readlines():

 if is_sometext(line):
 memory = line

 if line == 'maskit':
 print memory

My understanding of the question follows more like:

# Python 3, UNTESTED

memory = []
for line in open('file.txt').readlines():
if line == 'maskit':
print(*memory, sep=)

elif line:
memory.append(line)

else:
memory = []
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread sas429s

Ok here is a snippet of the text file I have:

config/meal/governor_mode_config.h
  #define GOVERNOR_MODE_TASK_RATE SSS_TID_0015MSEC
  #define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \
   CEAL_MODE_WORK_MASK_PARK_BRAKE | \
   CEAL_MODE_WORK_MASK_VEHICLE_SPEED)
  #define GOVERNOR_MODE_IDLE_CHECKFALSE
  #define GOVERNOR_MODE_SPD_THRES 50
  #define GOVERNOR_MODE_SPDDES_THRES  10

config/meal/components/source/kso_aic_core_config.h
  #define CEAL_KSO_AIC_CORE_TASK_RATE  SSS_TID_0120MSEC
  #define CEAL_KSO_AIC_LOAD_FAC_AVG_TIME   300
  #define CEAL_KSO_AIC_LOAD_FAC_HYST_TIME  30
  #define CEAL_KSO_AIC_TEMP_DPF_INSTALLED  TRUE
  #define CEAL_KSO_AIC_TEMP_DPF_ENABLE 450
  #define CEAL_KSO_AIC_TEMP_DPF_HYST   25
  #define CEAL_KSO_AIC_DPF_ROC_TIME10
  #define CEAL_KSO_AIC_TEMP_EXHAUST_INSTALLED  FALSE
  #define CEAL_KSO_AIC_TEMP_EXHAUST_ENABLE 275
  #define CEAL_KSO_AIC_TEMP_EXHAUST_HYST   25
  #define CEAL_KSO_AIC_EXHAUST_ROC_TIME10
  #define CEAL_KSO_AIC_WORK_MODE_MASK   (CEAL_MODE_WORK_MASK_GEAR   | \
   CEAL_MODE_WORK_MASK_PARK_BRAKE | \
   CEAL_MODE_WORK_MASK_VEHICLE_SPEED)
  #define CEAL_KSO_AIC_OV_TIME 15

Here I am looking for the line that contains: WORK_MODE_MASK, I want to print 
that line as well as the file name above it: config/meal/governor_mode_config.h
or config/meal/components/source/ceal_PackD_kso_aic_core_config.h.

SO the output should be something like this:
config/meal/governor_mode_config.h

#define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \
   CEAL_MODE_WORK_MASK_PARK_BRAKE | \
   CEAL_MODE_WORK_MASK_VEHICLE_SPEED)

config/meal/components/source/kso_aic_core_config.h
#define CEAL_KSO_AIC_WORK_MODE_MASK   (CEAL_MODE_WORK_MASK_GEAR   | \
   CEAL_MODE_WORK_MASK_PARK_BRAKE | \
   CEAL_MODE_WORK_MASK_VEHICLE_SPEED)

I hope this helps..

Thanks for your help


On Tuesday, July 2, 2013 3:12:55 PM UTC-5, Neil Cerutti wrote:
 On 2013-07-02, Tobiah t...@tobiah.org wrote:
 
  On 07/02/2013 12:30 PM, sas4...@gmail.com wrote:
 
  Somemore can be anything for instance:
 
 
 
  Sometext
 
  mail
 
  maskit
 
 
 
  Sometext
 
  rupee
 
  dollar
 
  maskit
 
 
 
  and so on..
 
 
 
  Is there a way I can achieve this?
 
 
 
  How do we know whether we have Sometext?
 
  If it's really just a literal 'Sometext', then
 
  just print that when you hit maskit.
 
 
 
  Otherwise:
 
 
 
 
 
  for line in open('file.txt').readlines():
 
  
 
  if is_sometext(line):
 
  memory = line
 
 
 
  if line == 'maskit':
 
  print memory
 
 
 
 Tobiah's solution fits what little we can make of your problem.
 
 
 
 My feeling is that you've simplified your question a little too
 
 much in hopes that it would help us provide a better solution.
 
 Can you provide more context? 
 
 
 
 -- 
 
 Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Joshua Landau

On 2 July 2013 21:28,  sas4...@gmail.com wrote:
 Here I am looking for the line that contains: WORK_MODE_MASK, I want to 
 print that line as well as the file name above it: 
 config/meal/governor_mode_config.h
 or config/meal/components/source/ceal_PackD_kso_aic_core_config.h.

 SO the output should be something like this:
 config/meal/governor_mode_config.h

 #define GOVERNOR_MODE_WORK_MODE_MASK(CEAL_MODE_WORK_MASK_GEAR| \
CEAL_MODE_WORK_MASK_PARK_BRAKE | \
CEAL_MODE_WORK_MASK_VEHICLE_SPEED)

 config/meal/components/source/kso_aic_core_config.h
 #define CEAL_KSO_AIC_WORK_MODE_MASK   (CEAL_MODE_WORK_MASK_GEAR   | \
CEAL_MODE_WORK_MASK_PARK_BRAKE | \
CEAL_MODE_WORK_MASK_VEHICLE_SPEED)

(Please don't top-post.)

filename = None

with open(tmp.txt) as file:
nonblanklines = (line for line in file if line)

for line in nonblanklines:
if line.lstrip().startswith(#define):
defn, name, *other = line.split()
if name.endswith(WORK_MODE_MASK):
print(filename, line, sep=)

else:
filename = line

Basically, you loop through remembering what lines you need, match a
little bit and ignore blank lines. If this isn't a solid
specification, you'll 'ave to tell me more about the edge-cases.

You said that

 #define CEAL_KSO_AIC_WORK_MODE_MASK   (CEAL_MODE_WORK_MASK_GEAR   | \
CEAL_MODE_WORK_MASK_PARK_BRAKE | \
CEAL_MODE_WORK_MASK_VEHICLE_SPEED)

was one line. If it is not, I suggest doing a pre-process to wrap
lines with trailing \s before running the algorithm:

def wrapped(lines):
wrap = 
for line in lines:
if line.rstrip().endswith(\\):
wrap += line

else:
yield wrap + line
wrap = 

...
nonblanklines = (line for line in wrapped(file) if line)
...


This doesn't handle all wrapped lines properly, as it leaves the \
in so may interfere with matching. That's easily fixable, and there
are many other ways to do this.

What did you try?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing Text file

2013-07-02 Thread Denis McMahon

On Tue, 02 Jul 2013 13:28:33 -0700, sas429s wrote:

 Ok here is a snippet of the text file I have:
 I hope this helps..
 .
 Thanks for your help

ok ... so you need to figure out how best to distinguish the filename, 
then loop through the file, remember each filename as you find it, and 
when you find lines containing your target text, print the current value 
of filename and the target text line.

filenames might be distinguished by one or more of the following:

They always start in column 0 and nothing else starts in column 0
They never contain spaces and all other lines contain spaces or are blank
They always contain at least one / characters
They always terminate with a . followed by one or more characters
All the characters in them are lower case

Then loop through the file in something like the following manner:

open input file;
open output file;
for each line in input file: {
if line is a filename: {
thisfile = line; }
elif line matches search term: {
print thisfile in output file;
print line in output file; } }
close input file;
close output file;

(Note this is an algorithm written in a sort of pythonic manner, rather 
than actual python code - also because some newsreaders may break 
indenting etc, I've used ; as line terminators and {} to group blocks)

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from ethtool command

2011-11-02 Thread extraspecialbitter

On Nov 1, 7:35 pm, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Tue, Nov 1, 2011 at 5:19 PM, Miki Tebeka miki.teb...@gmail.com wrote:
  In my box, there are some spaces (tabs?) before Speed. IMO 
  re.search(Speed, line) will be a more robust.

 Or simply:

 if Speed in line:

 There is no need for a regular expression here.  This would also work
 and be a bit more discriminating:

 if line.strip().startswith(Speed)

 BTW, to the OP, note that your condition (line[0:6] == Speed) cannot
 match, since line[0:6] is a 6-character slice, while Speed is a
 5-character string.

 Cheers,
 Ian

Ian,

Replacing my regular expression with line.strip().startswith did the
trick.  Thanks for the tip!

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from ethtool command

2011-11-02 Thread Jean-Michel Pichavant


extraspecialbitter wrote:

I'm still trying to write that seemingly simple Python script to print
out network interfaces (as found in the ifconfig -a command) and
their speed (ethtool interface).  The idea is to loop for each
interface and
print out its speed.  I'm looping correctly, but have some issues
parsing the output for all interfaces except for the pan0
interface.  I'm running on eth1, and the ifconfig -a command also
shows an eth0, and of course lo.  My script is trying to match on the
string Speed, but I never seem to successfully enter the if
clause.

First, here is the output of ethtool eth1:

=

Settings for eth1:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: pumbag
Wake-on: g
Current message level: 0x0001 (1)
Link detected: yes

=

The script *should* match on the string Speed and then assign 100Mb/
s to a variable, but is never getting past the second if statement
below:

=

#!/usr/bin/python

# Quick and dirty script to print out available interfaces and their
speed

# Initializations

output =  Interface: %s Speed: %s
noinfo = (Speed Unknown)
speed  = noinfo

import os, socket, types, subprocess

fp = os.popen(ifconfig -a)
dat=fp.read()
dat=dat.split('\n')
for line in dat:
if line[10:20] == Link encap:
   interface=line[:9]
   cmd = ethtool  + interface
   gp = os.popen(cmd)
   fat=gp.read()
   fat=fat.split('\n')
   for line in fat:
   if line[0:6] == Speed:
   try:
   speed=line[8:]
   except:
   speed=noinfo
print output % (interface, speed)

=

Again, I appreciate everyone's patience, as I'm obviously I'm a python
newbie.  Thanks in advance!
  
Hi, without starting a flamewar about regular expression, they sometimes 
can become usefull and really simplify code:


s1 = eth0  Link encap:Ethernet  HWaddr 00:1d:09:2b:d2:be
 inet addr:192.168.200.176  Bcast:192.168.200.255  
Mask:255.255.255.0

 inet6 addr: fe80::21d:9ff:fe2b:d2be/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:297475688 errors:0 dropped:7 overruns:0 frame:2
 TX packets:248662722 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:2795194692 (2.6 GiB)  TX bytes:2702265420 (2.5 GiB)
 Interrupt:17

loLink encap:Local Loopback
 inet addr:127.0.0.1  Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING  MTU:16436  Metric:1
 RX packets:5595504 errors:0 dropped:0 overruns:0 frame:0
 TX packets:5595504 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
 RX bytes:1601266268 (1.4 GiB)  TX bytes:1601266268 (1.4 GiB)



import re

itfs = [section for section in s1.split('\n\n') if section and section 
!= '\n'] # list of interfaces sections, filter the empty sections


for itf in itfs:
   match = re.search('^(\w+)', itf) # search the word at the begining 
of the section

   interface = match and match.group(1)
   match = re.search('MTU:(\d+)', itf) # search for the field MTU: and 
capture its digital value

   mtu = (match and match.group(1)) or 'MTU not found'
   print interface, mtu


 eth0 1500
 lo 16436

If you're not familiar with python regexp, I would advise to use 
kodos.py (google it), it really does help.
The strong point about the code above, is that it removes all the 
tedious if then else logic and the arbitrary slice indexes.


JM

PS : I cannot test the 'Speed' because it's absent from my ifconfig 
display, but you should be able to figure it out :o)

--
http://mail.python.org/mailman/listinfo/python-list

parsing text from ethtool command

2011-11-01 Thread extraspecialbitter

I'm still trying to write that seemingly simple Python script to print
out network interfaces (as found in the ifconfig -a command) and
their speed (ethtool interface).  The idea is to loop for each
interface and
print out its speed.  I'm looping correctly, but have some issues
parsing the output for all interfaces except for the pan0
interface.  I'm running on eth1, and the ifconfig -a command also
shows an eth0, and of course lo.  My script is trying to match on the
string Speed, but I never seem to successfully enter the if
clause.

First, here is the output of ethtool eth1:

=

Settings for eth1:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: pumbag
Wake-on: g
Current message level: 0x0001 (1)
Link detected: yes

=

The script *should* match on the string Speed and then assign 100Mb/
s to a variable, but is never getting past the second if statement
below:

=

#!/usr/bin/python

# Quick and dirty script to print out available interfaces and their
speed

# Initializations

output =  Interface: %s Speed: %s
noinfo = (Speed Unknown)
speed  = noinfo

import os, socket, types, subprocess

fp = os.popen(ifconfig -a)
dat=fp.read()
dat=dat.split('\n')
for line in dat:
if line[10:20] == Link encap:
   interface=line[:9]
   cmd = ethtool  + interface
   gp = os.popen(cmd)
   fat=gp.read()
   fat=fat.split('\n')
   for line in fat:
   if line[0:6] == Speed:
   try:
   speed=line[8:]
   except:
   speed=noinfo
print output % (interface, speed)

=

Again, I appreciate everyone's patience, as I'm obviously I'm a python
newbie.  Thanks in advance!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from ethtool command

2011-11-01 Thread Miki Tebeka

In my box, there are some spaces (tabs?) before Speed. IMO re.search(Speed, 
line) will be a more robust.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from ethtool command

2011-11-01 Thread Ian Kelly

On Tue, Nov 1, 2011 at 5:19 PM, Miki Tebeka miki.teb...@gmail.com wrote:
 In my box, there are some spaces (tabs?) before Speed. IMO 
 re.search(Speed, line) will be a more robust.

Or simply:

if Speed in line:

There is no need for a regular expression here.  This would also work
and be a bit more discriminating:

if line.strip().startswith(Speed)

BTW, to the OP, note that your condition (line[0:6] == Speed) cannot
match, since line[0:6] is a 6-character slice, while Speed is a
5-character string.

Cheers,
Ian
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-07 Thread Suraj Barkale

iainemsley iainemsley at googlemail.com writes:

 
 Hi,
 I'm trying to write a fairly basic text parser to split up scenes and
 acts in plays to put them into XML. I've managed to get the text split
 into the blocks of scenes and acts and returned correctly but I'm
 trying to refine this and get the relevant scene number when the split
 is made but I keep getting an NoneType error trying to read the block
 inside the for loop and nothing is being returned. I'd be grateful for
 some suggestions as to how to get this working.
 
 for scene in text.split('Scene'):
 num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
 textNum = num.match(scene)
 if textNum:
 print textNum
 else:
 print No scene number
 m = 'div type=scene'
 m += scene
 m += '\div'
 print m
 
 Thanks, Iain
 --
 http://mail.python.org/mailman/listinfo/python-list
 
 

Are you trying to match Roman numerals? As others have said, it is difficult to
make any suggestions without knowing the input to your program.

You may want to look at PyParsing (http://pyparsing.wikispaces.com/) to parse
the text file without messing with regular expressions.

Regards,
Suraj

--
http://mail.python.org/mailman/listinfo/python-list

Parsing text

2009-05-06 Thread iainemsley

Hi,
I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.

for scene in text.split('Scene'):
num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
textNum = num.match(scene)
if textNum:
print textNum
else:
print No scene number
m = 'div type=scene'
m += scene
m += '\div'
print m

Thanks, Iain
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread Shawn Milochik

On Wed, May 6, 2009 at 2:32 PM, iainemsley iainems...@googlemail.com wrote:
 Hi,
 I'm trying to write a fairly basic text parser to split up scenes and
 acts in plays to put them into XML. I've managed to get the text split
 into the blocks of scenes and acts and returned correctly but I'm
 trying to refine this and get the relevant scene number when the split
 is made but I keep getting an NoneType error trying to read the block
 inside the for loop and nothing is being returned. I'd be grateful for
 some suggestions as to how to get this working.

 for scene in text.split('Scene'):
    num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
    textNum = num.match(scene)
    if textNum:
        print textNum
    else:
        print No scene number
    m = 'div type=scene'
    m += scene
    m += '\div'
    print m

 Thanks, Iain


Can you provide some sample input so we can recreate the problem?

Also, consider something like this instead of the concatenation:

m = 'div type=scene%s/div' % (scene,)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread Scott David Daniels


iainemsley wrote:

Hi,
I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.

...(some code)...


You'll get a lot better help if you:
(1) Include enough code to run and encounter the problem.
Edit this down to something small (in the process,
you may discover what was wrong).
(2) Include actual sample data demonstrating the problem.
and (3) Cut and paste the _actual_ error message and traceback
from your output when running the sample code with the
sample data.
For extra points, identify the Python version you are using.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread MRAB


iainemsley wrote:

Hi,
I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.

for scene in text.split('Scene'):
num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
textNum = num.match(scene)
if textNum:
print textNum
else:
print No scene number
m = 'div type=scene'
m += scene
m += '\div'
print m


The problem is with your regular expression. Unfortunately, I can't tell
what you're trying to match. Could you provide some examples of the
scene numbers?
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread Tim Chase


I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.

for scene in text.split('Scene'):
num = re.compile(^\s\[0-9, i{1,4}, v], re.I)


The first thing that occurs to me is that this should likely be a 
raw string to get those backslashes into the regexp.  Compare:


  print ^\s\[0-9, i{1,4}, v]
  print r^\s\[0-9, i{1,4}, v]

Without an excerpt of the actual text (or at least the lead-in 
for each scene), it's hard to tell whether this regex finds what 
you expect.  It doesn't look like your regexp finds what you may 
think it does (it looks like you're using commas .


Just so you're aware, your split is a bit fragile too, in case 
any lines contain Scene.  However, with a proper regexp, you 
can even use it to split the scenes *and* tag the scene-number. 
Something like


   import re
   s = Scene [42]
  ... this is stuff in the 42nd scene
  ... Scene [IIV]
  ... stuff in the other scene
  ... 
   r = re.compile(rScene\s+\[(\d+|[ivx]+)], re.I)
   r.split(s)[1:]
  ['42', '\nthis is stuff in the 42nd scene\n', 'IIV', '\nstuff 
in the other scene\n']

   def grouper(iterable, groupby):
  ... iterable = iter(iterable)
  ... while True:
  ... yield [iterable.next() for _ in range(groupby)]
  ...

   for scene, content in grouper(r.split(s)[1:], 2):
  ... print div class='scene'h1%s/h1p%s/p/div 
% (scene, content)

  ...
  div class='scene'h142/h1p
  this is stuff in the 42nd scene
  /p/div
  div class='scene'h1IIV/h1p
  stuff in the other scene
  /p/div

Play accordingly.

-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread Stefan Behnel

iainemsley wrote:
 for scene in text.split('Scene'):
 num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
 textNum = num.match(scene)

Not related to your problem, but to your code - I'd write this as follows:

match_scene_num = re.compile(^\s\[0-9, i{1,4}, v], re.I).match

for scene_section in text.split('Scene'):
text_num = match_scene_num(scene_section)

This makes the code more readable and avoids unnecessary work inside the loop.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread Rhodri James

On Wed, 06 May 2009 19:32:28 +0100, iainemsley iainems...@googlemail.com  
wrote:



Hi,
I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.


With neither a sample of your data nor the traceback you get, this is
going to require some crystal ball work.  Assuming that all you've got
is running text, I should warn you now that getting this right is a
hard task.  Getting it apparently right and having it fall over in a
heap or badly mangle the text is, unfortunately, very easy.


for scene in text.split('Scene'):


Not a safe start.  This will split on the word Scenery as well, for
example, and doesn't guarantee you the start of a scene by a long way.


num = re.compile(^\s\[0-9, i{1,4}, v], re.I)


This is almost certainly not going to do what you expect, because all
those backslashes in the string are going to get processed as escape
characters before the string is ever passed to re.compile.  Even if
you fix that (by doubling the backslashes or making it a raw string),
I sincerely doubt that this is the regular expression you want.  As
escaped, it matches in sequence:

  * the start of the string
  * a space, tab, newline or other whitespace character.  Just the one.
  * the literal string [0-9, 
  * either i or I repeated between 1 and four times
  * the literal string , 
  * either v or V
  * the literal string ]

Assuming you didn't mean to escape the open square bracket doesn't help:

  * the start of the string
  * one whitespace character
  * one of the following characters: 0123456789,iI{}vV

Also, what the heck is this doing *inside* the for loop?


textNum = num.match(scene)


If you're using re.match(), the ^ on the regular expression is
redundant.


if textNum:
print textNum


textNum is the match object, so printing it won't tell you much.  In
particular, it isn't going to produce well-formed XML.


else:
print No scene number


Nor will this.


m = 'div type=scene'


Missing close double quotes after 'scene'.


m += scene
m += '\div'
print m


I'm seeing nothing here that should produce an error message that
has anything to do with NoneType.  Any chance of (a) a more accurate
code sample, (b) the traceback, or (c) sample data?

--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2009-05-06 Thread C or L Smith

 Hi,
 I'm trying to write a fairly basic text parser to split up scenes and
 acts in plays to put them into XML. I've managed to get the text split
 into the blocks of scenes and acts and returned correctly but I'm
 trying to refine this and get the relevant scene number when the split
 is made but I keep getting an NoneType error trying to read the block
 inside the for loop and nothing is being returned. I'd be grateful for
 some suggestions as to how to get this working.
 
 for scene in text.split('Scene'):
 num = re.compile(^\s\[0-9, i{1,4}, v], re.I)
 textNum = num.match(scene)
 if textNum:
 print textNum
 else:
 print No scene number
 m = 'div type=scene'
 m += scene
 m += '\div'
 print m
 
 Thanks, Iain
 

Don't forget that when you split the text, the first piece you get is what came 
*before* the thing you split on so there won't be a scene number in the first 
piece.

###
 print 'this foo 1 and that foo 2 and the end'.split('foo')
['this ', ' 1 and that ', ' 2 and the end']
###

If you have material before the first occurrence of the word 'Scene' you will 
want to print that out without decoration.

Also, it looks like you are trying to say with your regex that the scene number 
will come after some space and be a digit followed by a roman numeral of some 
kind(?). If the number looks like this 1iii or 2iv or then you could split your 
text with a regex rather than split:

###
 scene=re.compile('Scene\s+([0-9iIvV]+)')
 scene.split('The front matter Scene 1i The beginning was the best. Scene  
 1ii And then came the next act.')
['The front matter ', '1i', ' The beginning was the best. ', '1ii', ' And then 
came the next act.']
 
###

The \s+ indicates that there will be at least one space character and maybe 
more; the human error factor predicts that you will use more than one space 
after the word scene, so \s+ just allows for that possibility.

The 0-9iIvV indicate the possible characters that might be part of your scene 
number. Since it's unlikely that you will have any word appearing after Scene 
that matches that pattern, it isn't written to be exact in specifying what 
should come next. [1] The parenthesis tell what (beside the pieces left by 
removing the split target) should be presented. In this case, the parenthesis 
were put around the pattern that (maybe) represented your scene number and so 
those are interspersed with the list of pieces.

/chris

[1] If it were more precise it might be '([1-9][0-9]*(iv|v?i{0,3}))' which 
recognizes that a number should start with 1 or above and perhaps be followed 
by 0 or more digits (including 0) and then come the roman numeral possibilities 
(for up to viii) [2].  That | indicates or and the parenthesis go around 
the roman numeral part to indicate that the or doesn't extend back to the 
decimal digits. That extra set of parenthesis also means that the split will 
now contain TWO captured pieces between each piece of script. If you put a ? 
after the scene number part meaning that it may or may not be there, None will 
be returned for the patterns that are not there:

###
 scene=re.compile('Scene\s+([1-9][0-9]*(iv|v?i{0,3}))?')
 scene.split('The front matter Scene 1i The beginning was the best. Scene  
 1ii And then came the next act. Scene The last one has no number.')
['The front matter ', '1i', 'i', ' The beginning was the best. ', '1ii', 'ii', 
' And then came the next act. ', None, None, 'The last one has no number.']
 
###

[2] http://diveintopython.org/regular_expressions/roman_numerals.html
--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-30 Thread Tim Golden


Wes James wrote:

If I read a windows registry file with a line like this:

{C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program
Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted
Multicast|Edge=FALSE|



Watch out. .reg files exported from the registry are typically
in UTF16. Notepad and other editors will recognise this and
display what you see above, but if you were to, say, do this:


print repr (open (blah.reg).read ())

You might see a different picture. If that's the case, you'll
have to use the codecs module or decode the string you read.


TJG
--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-30 Thread John Machin

On Jan 30, 7:39 pm, Tim Golden m...@timgolden.me.uk wrote:
 Wes James wrote:
  If I read a windows registry file with a line like this:

  {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program
  Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted
  Multicast|Edge=FALSE|

 Watch out. .reg files exported from the registry are typically
 in UTF16. Notepad and other editors will recognise this and
 display what you see above, but if you were to, say, do this:

 print repr (open (blah.reg).read ())

 You might see a different picture. If that's the case, you'll
 have to use the codecs module or decode the string you read.


Ha! That's why it appeared to print LAND instead of LANDesk -- it
found and was printing L\0A\0N\0D.
--
http://mail.python.org/mailman/listinfo/python-list

parsing text from a file

2009-01-29 Thread Wes James

If I read a windows registry file with a line like this:

{C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program
Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted
Multicast|Edge=FALSE|

with this code:

f=open('fwrules.reg2.txt')

for s in f:
  if s.find('LANDesk') 0:
print s,


LANDesk is not found.

Also this does not work:

for s in f:
  try:
i=s.index('L')
print s[i:i+7]
 except:
   pass

all it prints is LAND

how do I find LANDesk in a string like this.  is the \\ messing things up?

thx,

-wj
--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-29 Thread Vlastimil Brom

2009/1/29 Wes James compte...@gmail.com:
 If I read a windows registry file with a line like this:

...

 with this code:

 f=open('fwrules.reg2.txt')

 for s in f:
  if s.find('LANDesk') 0:
print s,


 LANDesk is not found.

 how do I find LANDesk in a string like this.  is the \\ messing things up?
...

 thx,

 -wj

Hi,
 if s.find('LANDesk') 0:
is True for a line which doesn't contain LANDesk; if you want the
opposite, try
 if s.find('LANDesk') -1:
hth
  vbr
--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-29 Thread John Machin

On Jan 30, 8:54 am, Wes James compte...@gmail.com wrote:
 If I read a windows registry file with a line like this:

 {C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program
 Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted
 Multicast|Edge=FALSE|

 with this code:

 f=open('fwrules.reg2.txt')

 for s in f:
   if s.find('LANDesk') 0:
     print s,

 LANDesk is not found.

You mean it's not printed. That code prints all lines that don't
contain LANDesk


 Also this does not work:

 for s in f:
   try:
     i=s.index('L')
     print s[i:i+7]
  except:

Using except ValueError: would be safer.

    pass

 all it prints is LAND


AFAICT your reported outcome is impossible given that such a line
exists in the file.

 how do I find LANDesk in a string like this.

What you were trying (second time, or first time (with =) should
work. I suggest that to diagnose your problem you change the second
snippet as follows:
1. use except ValueError:
2. print s, len(s), i, and s.find('L') for all lines

  is the \\ messing things up?

Each \\ is presumably just the repr() of a single backslash. In any
case whether there are 0,1,2 or many backslashes in a line or the repr
() thereof has nothing to do with your problem.

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-29 Thread Tim Chase


 if s.find('LANDesk') 0:
is True for a line which doesn't contain LANDesk; if you want the
opposite, try
 if s.find('LANDesk') -1:


Or more pythonically, just use

  if 'LANDesk' in s:

-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text from a file

2009-01-29 Thread MRAB


Wes James wrote:

If I read a windows registry file with a line like this:

{C15039B5-C47C-47BD-A698-A462F4148F52}=v2.0|Action=Allow|Active=TRUE|Dir=In|Protocol=6|Profile=Public|App=C:\\Program
Files\\LANDesk\\LDClient\\tmcsvc.exe|Name=LANDesk Targeted
Multicast|Edge=FALSE|

with this code:

f=open('fwrules.reg2.txt')

for s in f:
  if s.find('LANDesk') 0:
print s,


LANDesk is not found.

Also this does not work:

for s in f:
  try:
i=s.index('L')
print s[i:i+7]
 except:
   pass

all it prints is LAND

how do I find LANDesk in a string like this.  is the \\ messing things up?

How do you know what's in the file? Did you use an editor? It might be 
that the file contents are encoded in, say, UTF-16 and the editor is 
detecting that and decoding it for you, but Python's open() function is 
just returning the contents as a bytestring (Python 2.x).


Try:

import codecs
f = codecs.open('fwrules.reg2.txt', encoding='UTF-16')

for s in f:
if u'LANDesk' in s:
print s,

f.close()
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text file with #include and #define directives

2008-04-25 Thread python

Arnaud,

Wow!!! That's beautiful. Thank you very much!

Malcolm

snip

I think it's straightforward enough to be dealt with simply.  Here is
a solution that doesn't handle errors but should work with well-formed
input and handles recursive expansions.

expand(filename) returns an iterator over expanded lines in the file,
inserting lines of included files.

import re

def expand(filename):
defines = {}
def define_repl(matchobj):
return defines[matchobj.group(1)]
define_regexp = re.compile('#(.+?)#')
for line in open(filename):
if line.startswith('#include '):
   recfilename = line.strip().split(None, 1)[1]
   for recline in expand(recfilename):
   yield recline
elif line.startswith('#define '):
   _, name, value = line.strip().split(None, 2)
   defines[name] = value
else:
yield define_regexp.sub(define_repl, line)

It would be easy to modify it to keep track of line numbers and file
names.

/snip
--
http://mail.python.org/mailman/listinfo/python-list

Parsing text file with #include and #define directives

2008-04-24 Thread python

I'm parsing a text file for a proprietary product that has the following
2 directives:

#include somefile
#define name value

Defined constants are referenced via #name# syntax.

I'm looking for a single text stream that results from processing a file
containing these directives. Even better would be an iterator(?) type
object that tracked file names and line numbers as it returns individual
lines.

Is there a Python parsing library to handle this type of task or am I
better off writing my own?

The effort to write one from scratch doesn't seem too difficult (minus
recursive file and constant loops), but I wanted to avoid re-inventing
the wheel if this type of component already exists.

Thank you,

Malcolm
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text file with #include and #define directives

2008-04-24 Thread Arnaud Delobelle

[EMAIL PROTECTED] writes:

 I'm parsing a text file for a proprietary product that has the following
 2 directives:

 #include somefile
 #define name value

 Defined constants are referenced via #name# syntax.

 I'm looking for a single text stream that results from processing a file
 containing these directives. Even better would be an iterator(?) type
 object that tracked file names and line numbers as it returns individual
 lines.

 Is there a Python parsing library to handle this type of task or am I
 better off writing my own?

 The effort to write one from scratch doesn't seem too difficult (minus
 recursive file and constant loops), but I wanted to avoid re-inventing
 the wheel if this type of component already exists.

 Thank you,

 Malcolm

I think it's straightforward enough to be dealt with simply.  Here is
a solution that doesn't handle errors but should work with well-formed
input and handles recursive expansions.

expand(filename) returns an iterator over expanded lines in the file,
inserting lines of included files.

import re

def expand(filename):
defines = {}
def define_repl(matchobj):
return defines[matchobj.group(1)]
define_regexp = re.compile('#(.+?)#')
for line in open(filename):
if line.startswith('#include '):
   recfilename = line.strip().split(None, 1)[1]
   for recline in expand(recfilename):
   yield recline
elif line.startswith('#define '):
   _, name, value = line.strip().split(None, 2)
   defines[name] = value
else:
yield define_regexp.sub(define_repl, line)

It would be easy to modify it to keep track of line numbers and file
names.

HTH

-- 
Arnaud
--
http://mail.python.org/mailman/listinfo/python-list

parsing text in blocks and line too

2007-04-12 Thread flyzone

Goodmorning people :)
I have just started to learn this language and i have a logical
problem.
I need to write a program to parse various file of text.
Here two sample:

---
trial text bla bla bla bla error
  bla bla bla bla bla
  bla bla bla on more lines
trial text bla bla bla bla warning bla
  bla bla more bla to be grouped with warning
  bla bla bla on more lines
  could be one two or ten lines also withouth the tab beginning
again text
text can contain also blank lines
text no delimiters
--
Apr  8 04:02:08 machine text on one line
Apr  8 04:02:09 machine this is an error
Apr  8 04:02:10 machine this is a warning
--
parsing the file, I'll need to decide if the line/group is an error,
warning or to skip.
Mine problem if how logical do it: if i read line by line, I'll catch
the error/warning
on first and the second/third/more will be skipped by control.
Reading a group of line i could lose the order on the output: my idea
is to have
an output in html with the line in the color of the check (yellow for
warning,
red for error).
And i have also many rules to be followed so if i read one rule and
then i search
on the entire file, the check will be really slow.

Hope someone could give me some tips.
Thanks in advance

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text in blocks and line too

2007-04-12 Thread A.T.Hofkamp

On 2007-04-12, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Goodmorning people :)
 I have just started to learn this language and i have a logical
 problem.
 I need to write a program to parse various file of text.
 Here two sample:

 ---
 trial text bla bla bla bla error
   bla bla bla bla bla
   bla bla bla on more lines
 trial text bla bla bla bla warning bla
   bla bla more bla to be grouped with warning
   bla bla bla on more lines
   could be one two or ten lines also withouth the tab beginning
 again text
 text can contain also blank lines
 text no delimiters
 --
 Apr  8 04:02:08 machine text on one line
 Apr  8 04:02:09 machine this is an error
 Apr  8 04:02:10 machine this is a warning
 --

I would first read groups of lines that belong together, then decide on each
group whether it is an error, warning, or whatever.
To preserve order in a group of lines, you can use lists.

From your example you could first compute a list of lists, like

[ [ trial text bla bla bla bla error,
  bla bla bla bla bla,
  bla bla bla on more lines ],
  [ trial text bla bla bla bla warning bla,
  bla bla more bla to be grouped with warning,
  bla bla bla on more lines,
  could be one two or ten lines also withouth the tab beginning ],
  [ again text ],
  [ text can contain also blank lines ],
  [ ],
  [ text no delimiters ]
]

Just above the text no delimiters line I have added an empty line, and I
translated that to an empty group of lines (denoted with the empty list).

By traversing the groups (ie over the outermost list), you can now decide for
each group what type of output it is, and act accordingly.

 Hope someone could give me some tips.

Sure, however, in general it is appreciated if you first show your own efforts
before asking the list for a solution.

Albert
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: parsing text in blocks and line too

2007-04-12 Thread James Stroud

A.T.Hofkamp wrote:
 On 2007-04-12, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Goodmorning people :)
 I have just started to learn this language and i have a logical
 problem.
 I need to write a program to parse various file of text.
 Here two sample:

 ---
 trial text bla bla bla bla error
   bla bla bla bla bla
   bla bla bla on more lines
 trial text bla bla bla bla warning bla
   bla bla more bla to be grouped with warning
   bla bla bla on more lines
   could be one two or ten lines also withouth the tab beginning
 again text
 text can contain also blank lines
 text no delimiters
 --
 Apr  8 04:02:08 machine text on one line
 Apr  8 04:02:09 machine this is an error
 Apr  8 04:02:10 machine this is a warning
 --
 
 I would first read groups of lines that belong together, then decide on each
 group whether it is an error, warning, or whatever.
 To preserve order in a group of lines, you can use lists.
 
 From your example you could first compute a list of lists, like
 
 [ [ trial text bla bla bla bla error,
   bla bla bla bla bla,
   bla bla bla on more lines ],
   [ trial text bla bla bla bla warning bla,
   bla bla more bla to be grouped with warning,
   bla bla bla on more lines,
   could be one two or ten lines also withouth the tab beginning ],
   [ again text ],
   [ text can contain also blank lines ],
   [ ],
   [ text no delimiters ]
 ]
 
 Just above the text no delimiters line I have added an empty line, and I
 translated that to an empty group of lines (denoted with the empty list).
 
 By traversing the groups (ie over the outermost list), you can now decide for
 each group what type of output it is, and act accordingly.
 
 Hope someone could give me some tips.
 
 Sure, however, in general it is appreciated if you first show your own efforts
 before asking the list for a solution.
 
 Albert

If groups have 0 indent first line and other lines in the group are 
indented, group the lines

blocks = []
block = []
for line in lines:
   if not line.startswith(' '):
 if block:
   blocks.append(block)
 block = []
   block.append(line)
if block:
   blocks.append(block)

But if 0 indent doesn't start a new block, don't expect this to work, 
but that is what I infer from your limited sample.

You can then look for warnings, etc., in the blocks--either in the loop 
to save memory or in the constructed blocks list.

James


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread Bengt Richter

On 19 Dec 2005 15:15:10 -0800, sicvic [EMAIL PROTECTED] wrote:

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of -

Right now I can only have python write just the line the key phrase is
found in.

This sounds like homework, so just a (big) hint: have a look at itertools
dropwhile and takewhile. The solution is potentially a one-liner, depending
on your matching criteria (e.g., case-sensitive fixed string vs regular 
expression).

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread sicvic

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.

Since I cant show the actual output file lets say I had an output file
that looked like this:

a b Person: Jimmy
Current Location: Denver
Next Location: Chicago
--
a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
--

Now I want to put (and all recurrences of Person: Jimmy)

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if  re.search(r'Person: Jimmy', line):
person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

a b Person: Jimmy

sarah.txt:

a b Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

a b Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York


Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'--'

Any help is greatly appreciated.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread rzed

sicvic [EMAIL PROTECTED] wrote in
news:[EMAIL PROTECTED]: 

 Not homework...not even in school (do any universities even
 teach classes using python?). Just not a programmer. Anyways I
 should probably be more clear about what I'm trying to do.
 
 Since I cant show the actual output file lets say I had an
 output file that looked like this:
 
 a b Person: Jimmy
 Current Location: Denver
 Next Location: Chicago
 --
 a b Person: Sarah
 Current Location: San Diego
 Next Location: Miami
 Next Location: New York
 --
 
 Now I want to put (and all recurrences of Person: Jimmy)
 
 Person: Jimmy
 Current Location: Denver
 Next Location: Chicago
 
 in a file called jimmy.txt
 
 and the same for Sarah in sarah.txt
 
 The code I currently have looks something like this:
 
 import re
 import sys
 
 person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
 person_sarah = open('sarah.txt', 'w') #creates sarah.txt
 
 f = open(sys.argv[1]) #opens output file
 #loop that goes through all lines and parses specified text
 for line in f.readlines():
 if  re.search(r'Person: Jimmy', line):
  person_jimmy.write(line)
 elif re.search(r'Person: Sarah', line):
  person_sarah.write(line)
 
 #closes all files
 
 person_jimmy.close()
 person_sarah.close()
 f.close()
 
 However this only would produces output files that look like
 this: 
 
 jimmy.txt:
 
 a b Person: Jimmy
 
 sarah.txt:
 
 a b Person: Sarah
 
 My question is what else do I need to add (such as an embedded
 loop where the if statements are?) so the files look like this
 
 a b Person: Jimmy
 Current Location: Denver
 Next Location: Chicago
 
 and
 
 a b Person: Sarah
 Current Location: San Diego
 Next Location: Miami
 Next Location: New York
 
 
 Basically I need to add statements that after finding that line
 copy all the lines following it and stopping when it sees
 '--'
 
 Any help is greatly appreciated.
 

Something like this, maybe?


This iterates through a file, with subloops to handle the 
special cases. I'm assuming that Jimmy and Sarah are not the
only people of interest. I'm also assuming (for no very good
reason) that you do want the separator lines, but do not want 
the Person: lines in the output file. It is easy enough to 
adjust those assumptions to taste.

Each Person: line will cause a file to be opened (if it is 
not already open, and will write the subsequent lines to it 
until the separator is found. Be aware that all files remain 
open unitl the loop at the end closes them all.


outfs = {}
f = open('shouldBeDatabase.txt')
for line in f:
if line.find('Person:') = 0:
ofkey = line[line.find('Person:')+7:].strip()
if not ofkey in outfs:
outfs[ofkey] = open('%s.txt' % ofkey, 'w')
outf = outfs[ofkey]
while line.find('-')  0:
line = f.next()
outf.write('%s' % line)
f.close()
for k,v in outfs.items():
v.close()

-- 
rzed
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread Gerard Flanagan

sicvic wrote:

 Since I cant show the actual output file lets say I had an output file
 that looked like this:

 a b Person: Jimmy
 Current Location: Denver

It may be the output of another process but it's the input file as far
as the parsing code is concerned.

The code below gives the following output, if that's any help ( just
adapting Noah's idea above).  Note that it deals with the input as a
single string rather than line by line.


Jimmy
Jimmy.txt

Current Location: Denver
Next Location: Chicago

Sarah
Sarah.txt

Current Location: San Diego
Next Location: Miami
Next Location: New York



data='''
a b Person: Jimmy
Current Location: Denver
Next Location: Chicago
--
a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
--
'''

import StringIO
import re


src = StringIO.StringIO(data)

for name in ['Jimmy', 'Sarah']:
exp = (?s)Person: %s(.*?)-- % name
filename = %s.txt % name
info = re.findall(exp, src.getvalue())[0]
print name
print filename
print info



hth

Gerard

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread Scott David Daniels

sicvic wrote:
 Not homework...not even in school (do any universities even teach
 classes using python?). 
Yup, at least 6, and 20 wouldn't surprise me.

 The code I currently have looks something like this:
 ...
 f = open(sys.argv[1]) #opens output file
 #loop that goes through all lines and parses specified text
 for line in f.readlines():
 if  re.search(r'Person: Jimmy', line):
   person_jimmy.write(line)
 elif re.search(r'Person: Sarah', line):
   person_sarah.write(line)
Using re here seems pretty excessive.
How about:
 ...
 f = open(sys.argv[1])  # opens input file ### get comments right
 source = iter(f)  # files serve lines at their own pace.  Let them
 for line in source:
 if line.endswith('Person: Jimmy\n'):
 dest = person_jimmy
 elif line.endswith('Person: Sarah\n'):
 dest = person_sarah
 else:
 continue
 while line != '---\n':
 dest.write(line)
 line = source.next()
 f.close()
 person_jimmy.close()
 person_sarah.close()

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread sicvic

Thank you everyone!!!

I got a lot more information then I expected. You guys got my brain
thinking in the right direction and starting to like programming.
You've got a great community here. Keep it up.

Thanks,
Victor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-20 Thread Bengt Richter

On 20 Dec 2005 08:06:39 -0800, sicvic [EMAIL PROTECTED] wrote:

Not homework...not even in school (do any universities even teach
classes using python?). Just not a programmer. Anyways I should
probably be more clear about what I'm trying to do.
Ok, not homework.


Since I cant show the actual output file lets say I had an output file
that looked like this:

a b Person: Jimmy
Current Location: Denver
Next Location: Chicago
--
a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
--

Now I want to put (and all recurrences of Person: Jimmy)

Person: Jimmy
Current Location: Denver
Next Location: Chicago

in a file called jimmy.txt

and the same for Sarah in sarah.txt

The code I currently have looks something like this:

import re
import sys

person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
person_sarah = open('sarah.txt', 'w') #creates sarah.txt

f = open(sys.argv[1]) #opens output file
#loop that goes through all lines and parses specified text
for line in f.readlines():
if  re.search(r'Person: Jimmy', line):
   person_jimmy.write(line)
elif re.search(r'Person: Sarah', line):
   person_sarah.write(line)

#closes all files

person_jimmy.close()
person_sarah.close()
f.close()

However this only would produces output files that look like this:

jimmy.txt:

a b Person: Jimmy

sarah.txt:

a b Person: Sarah

My question is what else do I need to add (such as an embedded loop
where the if statements are?) so the files look like this

a b Person: Jimmy
Current Location: Denver
Next Location: Chicago

and

a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York


Basically I need to add statements that after finding that line copy
all the lines following it and stopping when it sees
'--'

Any help is greatly appreciated.

Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) 
directory
Not tested beyond what you see. Tweak to suit.

 extractfilesegs.py 


Usage: [python] extractfilesegs [source [outdir [startpat [endpat
where source is -tf for test file, a file name, or an open file
  outdir is a directory prefix that will be joined to output file names
  startpat is a regular expression with group 1 giving the extracted 
file name
  endpat is a regular expression whose match line is excluded and ends 
the segment

import re, os

def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', 
stop='-'*30):
rxstart = re.compile(start)
rxstop = re.compile(stop)
if isinstance(linesrc, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search(line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(outdir, filename)
#print 'opening file %r'%filename
files.append(filename)
fout = open(filename, 'a') # append in case repeats?
fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(data_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data_line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files

def get_testfile():
from StringIO import StringIO
return StringIO(\
...irrelevant leading
stuff ...
a b Person: Jimmy
Current Location: Denver
Next Location: Chicago
--
a b Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
--
irrelevant
trailing stuff ...

with a blank line
)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__doc__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs(fin)
elif len(args)==1:
files = extractFileSegs(fin, args[0])
elif len(args)==2:
files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank 
line?
else:
files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print '%s'% fname
if tf == '-tf':
for fpath in files:
print ' %s

Parsing text

2005-12-19 Thread sicvic

I was wondering if theres a way where python can read through the lines
of a text file searching for a key phrase then writing that line and
all lines following it up to a certain point, such as until it sees a
string of -

Right now I can only have python write just the line the key phrase is
found in.

Thanks,
Victor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-19 Thread Peter Hansen

sicvic wrote:
 I was wondering if theres a way where python can read through the lines
 of a text file searching for a key phrase then writing that line and
 all lines following it up to a certain point, such as until it sees a
 string of -
 
 Right now I can only have python write just the line the key phrase is
 found in.

That's a good start.  Maybe you could post the code that you've already 
got that does this, and people could comment on it and help you along. 
(I'm suggesting that partly because this almost sounds like homework, 
but you'll benefit more by doing it this way than just by having an 
answer handed to you whether this is homework or not.)

-Peter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

2005-12-19 Thread Noah

sicvic wrote:
 I was wondering if theres a way where python can read through the lines
 of a text file searching for a key phrase then writing that line and
 all lines following it up to a certain point, such as until it sees a
 string of -
...
 Thanks,
 Victor

You did not specify the key phrase that you are looking for, so for
the sake
of this example I will assume that it is key phrase.
I assume that you don't want key phrase or - to
be returned
as part of your match, so we use minimal group matching (.*?)
You also want your regular expression to use the re.DOTALL flag because
this
is how you match across multiple lines. The simplest way to set this
flag is
to simply put it at the front of your regular expression using the (?s)
notation.

This gives you something like this:
print re.findall ((?s)key phrase(.*?)-,
your_string_to_search) [0]

So what that basically says is:
1. Match multiline -- that is, match across lines (?s)
2. match key phrase
3. Capture the group matching everything (?.*)
4. Match -
5. Print the first match in the list [0]

Yours,
Noah

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-17 Thread John Machin

On Tue, 17 May 2005 16:44:12 -0500, Mike Meyer [EMAIL PROTECTED] wrote:

Thomas W [EMAIL PROTECTED] writes:

 I'm developing a web-application where the user sometimes has to enter
 dates in plain text, allthough a format may be provided to give clues.
 On the server side this piece of text has to be parsed into a datetime
 python-object. Does anybody have any pointers on this?

Why are you making it possible for the users to screw this up? Don't
give them a text widget to fill in and you have to figure out what the
format is, give them three widgets so you *know* what's what.

In doing that, you can also go to dropdown widgets for month, with
month names (in a locale appropriate for the page language), and for
the days in the month. 

My experience: drop-down lists generate off-by-one errors. They also
annoy the bejaysus out of users -- e.g. year of birth, a 60+ element
list. It's quite possible of course that YMMV :-)

BTW: I have seen a web page with a drop-down list for year of birth
where the first 18 entries were current year, current year - 1,
etc for a transaction that wasn't for minors.




-- 
http://mail.python.org/mailman/listinfo/python-list

Parsing text into dates?

2005-05-16 Thread Thomas W

I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
month/day/year compared to the clever and intuitive european-style
day/month/year etc.

I've searched google, but haven't found any good referances that helped
me solve this problem, especially with regards to the locale date
format issues.

Best regards,
Thomas

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread John Machin

On 16 May 2005 13:59:31 -0700, Thomas W [EMAIL PROTECTED]
wrote:

I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
month/day/year compared to the clever and intuitive european-style
day/month/year etc.

rant
Well I'm from a locale that uses the dd/mm/ style and I think it's
only marginally less stupid than the mm/dd/ style.
/rant

How much intuition is required to determine in an international
context what was meant by 01/12/2004? First of December or 12th of
January? The consequences of misinterpretation can be enormous.

If this application is being deployed from a central server where the
users can be worldwide, you have two options:

(a) try to work out somehow what the user's locale is, and then work
with dates in the legacy format appropriate to the locale.

(b) Use the considerably-less-stupid ISO 8601 standard format
-mm-dd (e.g. 2004-12-01) -- throughout your web-application, not
just in your data entry. 

Having said all of that, [bottom-up question] how are you handling
locale differences in language, script, currency symbol, decimal
point, thousands separator, postal address formats, surname /
given-name order, etc etc etc? [top-down question] What *is* your
target audience?


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread Peter Hansen

John Machin wrote:
 If this application is being deployed from a central server where the
 users can be worldwide, you have two options:
 
 (a) try to work out somehow what the user's locale is, and then work
 with dates in the legacy format appropriate to the locale.

And this inevitably screws a large number of Canadians (and probably 
others), those poor conflicted folk caught between their European roots 
and their American neighbours, some of whom use mm/dd/yy and others of 
whom use dd/mm/yy on a regular basis.  And some of us who switch 
willy-nilly, much as we do between metric and imperial. :-(

 (b) Use the considerably-less-stupid ISO 8601 standard format
 -mm-dd (e.g. 2004-12-01) -- throughout your web-application, not
 just in your data entry. 

+1 (emphatically!)  (I almost always use this form even on government 
submissions, and nobody has complained yet.  Of course, they haven't 
started changing the forms yet, either...)

-Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread George Sakkis

Thomas W wrote:

 I'm developing a web-application where the user sometimes has to
enter
 dates in plain text, allthough a format may be provided to give
clues.
 On the server side this piece of text has to be parsed into a
datetime
 python-object. Does anybody have any pointers on this?

 Besides the actual parsing, my main concern is the different locale
 date formats and how to be able to parse those strange us-like
 month/day/year compared to the clever and intuitive european-style
 day/month/year etc.

 I've searched google, but haven't found any good referances that
helped
 me solve this problem, especially with regards to the locale date
 format issues.

 Best regards,
 Thomas

Although it is not a solution to the general localization problem, you
may try the mx.DateTimeFrom() factory function
(http://www.egenix.com/files/python/mxDateTime.html#DateTime) for the
parsing part. I had also written some time ago a more robust and
customized version of such parser. The ambiguous us/european style
dates are disambiguated by the provided optional argument USA (False by
default wink). Below is the doctest and the documentation (with
epydoc tags); mail me offlist if you'd like to check it out.

George

#===

def parseDateTime(string, USA=False, implyCurrentDate=False,
  yearHeuristic=_20thcenturyHeuristic):
'''Tries to parse a string as a valid date and/or time.

It recognizes most common (and less common) date and time formats.

Examples:
 # doctest was run succesfully on...
 str(datetime.date.today())
'2005-05-16'
 str(parseDateTime('21:23:39.91'))
'21:23:39.91'
 str(parseDateTime('16:15'))
'16:15:00'
 str(parseDateTime('10am'))
'10:00:00'
 str(parseDateTime('2:7:18.'))
'02:07:18'
 str(parseDateTime('08:32:40 PM'))
'20:32:40'
 str(parseDateTime('11:59pm'))
'23:59:00'
 str(parseDateTime('12:32:9'))
'12:32:09'
 str(parseDateTime('12:32:9', implyCurrentDate=True))
'2005-05-16 12:32:09'
 str(parseDateTime('93/7/18'))
'1993-07-18'
 str(parseDateTime('15.6.2001'))
'2001-06-15'
 str(parseDateTime('6.15.2001'))
'2001-06-15'
 str(parseDateTime('1980, November 20'))
'1980-11-20'
 str(parseDateTime('4 Mar 79'))
'1979-03-04'
 str(parseDateTime('July 4'))
'2005-07-04'
 str(parseDateTime('15/08'))
'2005-08-15'
 str(parseDateTime('5 Mar 3:45pm'))
'2005-03-05 15:45:00'
 str(parseDateTime('01 02 2003'))
'2003-02-01'
 str(parseDateTime('01 02 2003', USA=True))
'2003-01-02'
 str(parseDateTime('3/4/92'))
'1992-04-03'
 str(parseDateTime('3/4/92', USA=True))
'1992-03-04'
 str(parseDateTime('12:32:09 1-2-2003'))
'2003-02-01 12:32:09'
 str(parseDateTime('12:32:09 1-2-2003', USA=True))
'2003-01-02 12:32:09'
 str(parseDateTime('3:45pm 5 12 2001'))
'2001-12-05 15:45:00'
 str(parseDateTime('3:45pm 5 12 2001', USA=True))
'2001-05-12 15:45:00'

@param USA: Disambiguates strings that are valid dates in both
(month,
day, year) and (day, month, year) order (e.g. 05/03/2002). If
True,
the first format is assumed.
@param implyCurrentDate: If True and the date is not given, the
current
date is implied.
@param yearHeuristic: If not None, a callable f(year) that
transforms the
value of the given year. The default heuristic transforms
2-digit
years to 4-digit years assuming they are in the 20th century::
lambda year: (year = 100 and year
  or year = 10 and 1900 + year
  or None)
The heuristic should return None if the year is not considered
valid.
If yearHeuristic is None, no year transformation takes place.
@return:
- C{datetime.date} if only the date is recognized.
- C{datetime.time} if only the time is recognized and
implyCurrentDate
is False.
- C{datetime.datetime} if both date and time are recognized.
@raise ValueError: If the string cannot be parsed successfully.
'''

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread John Machin

On 16 May 2005 17:51:31 -0700, George Sakkis [EMAIL PROTECTED]
wrote:


#===

def parseDateTime(string, USA=False, implyCurrentDate=False,
  yearHeuristic=_20thcenturyHeuristic):
'''Tries to parse a string as a valid date and/or time.

It recognizes most common (and less common) date and time formats.

Impressive!



Examples:
[snip]
 str(parseDateTime('15.6.2001'))
'2001-06-15'
 str(parseDateTime('6.15.2001'))
'2001-06-15'

A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily
typoed into 6.13.2001 or 6.15.2001 on the numeric keypad.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread George Sakkis

John Machin [EMAIL PROTECTED] wrote:

 On 16 May 2005 17:51:31 -0700, George Sakkis [EMAIL PROTECTED]
 wrote:


 #===
 
 def parseDateTime(string, USA=False, implyCurrentDate=False,
   yearHeuristic=_20thcenturyHeuristic):
 '''Tries to parse a string as a valid date and/or time.
 
 It recognizes most common (and less common) date and time
formats.

 Impressive!


 
 Examples:
 [snip]
  str(parseDateTime('15.6.2001'))
 '2001-06-15'
  str(parseDateTime('6.15.2001'))
 '2001-06-15'

 A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily
 typoed into 6.13.2001 or 6.15.2001 on the numeric keypad.

Sure, but how is this different from a typo of 2001-12-07 instead of
2001-12-06 ? There's no way you can catch all typos anyway by parsing
alone. Besides, 6.15.2001 is to be interpreted as 2001-06-15 in US
format. Currently the 'USA' flag is used only for ambiguous dates, but
that's easy to change to apply to all dates. Essentially you would gain
a little extra safety at the expense of a little lost recall over the
set of parseable dates.

George

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread John Roth

Thomas W [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 I'm developing a web-application where the user sometimes has to enter
 dates in plain text, allthough a format may be provided to give clues.
 On the server side this piece of text has to be parsed into a datetime
 python-object. Does anybody have any pointers on this?

 Besides the actual parsing, my main concern is the different locale
 date formats and how to be able to parse those strange us-like
 month/day/year compared to the clever and intuitive european-style
 day/month/year etc.

 I've searched google, but haven't found any good referances that helped
 me solve this problem, especially with regards to the locale date
 format issues.

There is no easy answer if you want to be able to enter three
numbers. There are two answers that work, although there will
be a lot of complaining. One is to use the international -mm-dd
form, and the other is to accept a 4 digit year, an alphabetic month
and a two digit day in any order.

Otherwise, if you get 4 digits as the first component, and it passes your
validation (whatever that is) for reasonable years, you're probably
pretty safe to assume that you've got -mm-dd. Otherwise
if you can't get a clean answser (one is  31, one is 12  x  32
and one is = 12, just give them a list of possibilities and politely
suggest that they enter it as -mm-dd next time.

I don't validate separators. As long as there is something that isn't a
number or a letter, it's a separator and which one doesn't matter. At
times I've even taken the transition between a digit and a letter as
a separator.

John  Roth

 Best regards,
 Thomas
 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text into dates?

2005-05-16 Thread gene . tani

The beautiful brand new cookbook2 has Fuzzy parsing of Dates using
dateutil.parser, which you run once you have a decent guess at locale
(page 127 of cookbook)

John Roth wrote:
 Thomas W [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]
  I'm developing a web-application where the user sometimes has to
enter
  dates in plain text, allthough a format may be provided to give
clues.
  On the server side this piece of text has to be parsed into a
datetime
  python-object. Does anybody have any pointers on this?
 
  Besides the actual parsing, my main concern is the different locale
  date formats and how to be able to parse those strange us-like
  month/day/year compared to the clever and intuitive
european-style
  day/month/year etc.
 
  I've searched google, but haven't found any good referances that
helped
  me solve this problem, especially with regards to the locale date
  format issues.

 There is no easy answer if you want to be able to enter three
 numbers. There are two answers that work, although there will
 be a lot of complaining. One is to use the international -mm-dd
 form, and the other is to accept a 4 digit year, an alphabetic month
 and a two digit day in any order.

 Otherwise, if you get 4 digits as the first component, and it passes
your
 validation (whatever that is) for reasonable years, you're probably
 pretty safe to assume that you've got -mm-dd. Otherwise
 if you can't get a clean answser (one is  31, one is 12  x  32
 and one is = 12, just give them a list of possibilities and politely
 suggest that they enter it as -mm-dd next time.

 I don't validate separators. As long as there is something that isn't
a
 number or a letter, it's a separator and which one doesn't matter. At
 times I've even taken the transition between a digit and a letter as
 a separator.
 
 John  Roth
 
  Best regards,
  Thomas
 

-- 
http://mail.python.org/mailman/listinfo/python-list

59 matches

Mail list logo