Re: [Tutor] advice on regex matching for dates?

2008-12-12 Thread Lie Ryan
On Thu, 11 Dec 2008 23:38:52 +0100, spir wrote:

 Serdar Tumgoren a écrit :
 Hey everyone,
 
 I was wondering if there is a way to use the datetime module to check
 for variations on a month name when performing a regex match?
 
 In the script below, I created a regex pattern that checks for dates in
 the following pattern:  August 31, 2007. If there's a match, I can
 then print the capture date and the line from which it was extracted.
 
 While it works in this isolated case, it struck me as not very
 flexible. What happens when I inevitably get data that has dates
 formatted in a different way? Do I have to create some type of library
 that contains variations on each month name (i.e. - January, Jan., 01,
 1...) and use that to parse each line?
 
 Or is there some way to use datetime to check for date patterns when
 using regex? Is there a best practice in this area that I'm unaware
 of in this area?
 
 Apologies if this question has been answered elsewhere. I wasn't sure
 how to research this topic (beyond standard datetime docs), but I'd be
 happy to RTM if someone can point me to some resources.
 
 Any suggestions are welcome (including optimizations of the code
 below).
 
 Regards,
 Serdar
 
 #!/usr/bin/env python
 
 import re, sys
 
 sourcefile = open(sys.argv[1],'r')
 
 pattern =
 re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|
August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear
\d{4})')
 
 pattern2 = re.compile(r'Return to List')
 
 counter = 0
 
 for line in sourcefile:
 x = pattern.search(line)
 break_point = pattern2.match(line)
 
 if x:
 counter +=1
 print %s %d, %d == %s % ( x.group('month'),
 int(x.group('day')),
 int(x.group('year')), line ),
 elif break_point:
 break
 
 print counter
 sourcefile.close()
 
 I just found a simple, but nice, trick to make regexes less unlegible.
 Using substrings to represent sub-patterns. E.g. instead of:
 
 p =
 re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|
August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear
\d{4})')
 
 write first:
 
 month =
 r'(?PmonthJanuary|February|March|April|May|June|July|August|September|
October|November|December)'
 day = r'(?Pday\d{1,2})'
 year = r'(?Pyear\d{4})'
 
 then:
 p = re.compile( r%s\s%s,\s%s % (month,day,year) ) or even:
 p = re.compile( r%(month)s\s%(day)s,\s%(year)s %
 {'month':month,'day':day,'year':year} )
 

You might realize that that wouldn't help much and might even hide bugs 
(or make it harder to see). regex IS hairy in the first place. 

I'd vote for namespace in regex to allow a pattern inside a special pair 
tags to be completely ignorant of anything outside it. And also I'd vote 
for a compiled-pattern to be used as a subpattern. But then, with all 
those limbs, I might as well use a full-blown syntax parser instead.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] advice on regex matching for dates?

2008-12-11 Thread Kent Johnson
On Thu, Dec 11, 2008 at 2:31 PM, Serdar Tumgoren zstumgo...@gmail.com wrote:
 Hey everyone,

 I was wondering if there is a way to use the datetime module to check for
 variations on a month name when performing a regex match?

Parsing arbitrary dates is hard, in fact the most general case is not
solvable since 7/8/07 could refer to July 8 or August 7 and the
century must be guessed.

The only approach I know of is to try the formats you expect until you
find one that works. Mark Pilgrim's feedparser has a pretty robust
date parser you might look at; look at the _parse_date() function in
this module:
http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] advice on regex matching for dates?

2008-12-11 Thread spir

Serdar Tumgoren a écrit :

Hey everyone,

I was wondering if there is a way to use the datetime module to check for
variations on a month name when performing a regex match?

In the script below, I created a regex pattern that checks for dates in the
following pattern:  August 31, 2007. If there's a match, I can then print
the capture date and the line from which it was extracted.

While it works in this isolated case, it struck me as not very flexible.
What happens when I inevitably get data that has dates formatted in a
different way? Do I have to create some type of library that contains
variations on each month name (i.e. - January, Jan., 01, 1...) and use that
to parse each line?

Or is there some way to use datetime to check for date patterns when using
regex? Is there a best practice in this area that I'm unaware of in this
area?

Apologies if this question has been answered elsewhere. I wasn't sure how to
research this topic (beyond standard datetime docs), but I'd be happy to RTM
if someone can point me to some resources.

Any suggestions are welcome (including optimizations of the code below).

Regards,
Serdar

#!/usr/bin/env python

import re, sys

sourcefile = open(sys.argv[1],'r')

pattern =
re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear\d{4})')

pattern2 = re.compile(r'Return to List')

counter = 0

for line in sourcefile:
x = pattern.search(line)
break_point = pattern2.match(line)

if x:
counter +=1
print %s %d, %d == %s % ( x.group('month'), int(x.group('day')),
int(x.group('year')), line ),
elif break_point:
break

print counter
sourcefile.close()


I just found a simple, but nice, trick to make regexes less unlegible. Using 
substrings to represent sub-patterns. E.g. instead of:


p = 
re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear\d{4})')


write first:

month = 
r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)'

day = r'(?Pday\d{1,2})'
year = r'(?Pyear\d{4})'

then:
p = re.compile( r%s\s%s,\s%s % (month,day,year) )
or even:
p = re.compile( r%(month)s\s%(day)s,\s%(year)s % 
{'month':month,'day':day,'year':year} )


denis


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] advice on regex matching for dates?

2008-12-11 Thread John Fouhy
On 12/12/2008, spir denis.s...@free.fr wrote:
  I just found a simple, but nice, trick to make regexes less unlegible.
 Using substrings to represent sub-patterns. E.g. instead of:
[...]

Another option is to use the re.VERBOSE flag.  This allows you to put
comments in your regular expression and use whitespace for formatting.
 See this article for an example:

http://diveintopython.org/regular_expressions/verbose.html

HTH!

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] advice on regex matching for dates?

2008-12-11 Thread Serdar Tumgoren
Thanks to all for the replies!

The FeedParser example is mind-bending (at least for this noob), but of
course looks very powerful.  And thanks for the tips on cleaning up the
appearance of the code with sup-patterns. Those regex's can get hairy pretty
fast.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor