Re: [Tutor] advice on regex matching for dates?
On Thu, 11 Dec 2008 23:38:52 +0100, spir wrote: Serdar Tumgoren a écrit : Hey everyone, I was wondering if there is a way to use the datetime module to check for variations on a month name when performing a regex match? In the script below, I created a regex pattern that checks for dates in the following pattern: August 31, 2007. If there's a match, I can then print the capture date and the line from which it was extracted. While it works in this isolated case, it struck me as not very flexible. What happens when I inevitably get data that has dates formatted in a different way? Do I have to create some type of library that contains variations on each month name (i.e. - January, Jan., 01, 1...) and use that to parse each line? Or is there some way to use datetime to check for date patterns when using regex? Is there a best practice in this area that I'm unaware of in this area? Apologies if this question has been answered elsewhere. I wasn't sure how to research this topic (beyond standard datetime docs), but I'd be happy to RTM if someone can point me to some resources. Any suggestions are welcome (including optimizations of the code below). Regards, Serdar #!/usr/bin/env python import re, sys sourcefile = open(sys.argv[1],'r') pattern = re.compile(r'(?PmonthJanuary|February|March|April|May|June|July| August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear \d{4})') pattern2 = re.compile(r'Return to List') counter = 0 for line in sourcefile: x = pattern.search(line) break_point = pattern2.match(line) if x: counter +=1 print %s %d, %d == %s % ( x.group('month'), int(x.group('day')), int(x.group('year')), line ), elif break_point: break print counter sourcefile.close() I just found a simple, but nice, trick to make regexes less unlegible. Using substrings to represent sub-patterns. E.g. instead of: p = re.compile(r'(?PmonthJanuary|February|March|April|May|June|July| August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear \d{4})') write first: month = r'(?PmonthJanuary|February|March|April|May|June|July|August|September| October|November|December)' day = r'(?Pday\d{1,2})' year = r'(?Pyear\d{4})' then: p = re.compile( r%s\s%s,\s%s % (month,day,year) ) or even: p = re.compile( r%(month)s\s%(day)s,\s%(year)s % {'month':month,'day':day,'year':year} ) You might realize that that wouldn't help much and might even hide bugs (or make it harder to see). regex IS hairy in the first place. I'd vote for namespace in regex to allow a pattern inside a special pair tags to be completely ignorant of anything outside it. And also I'd vote for a compiled-pattern to be used as a subpattern. But then, with all those limbs, I might as well use a full-blown syntax parser instead. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] advice on regex matching for dates?
On Thu, Dec 11, 2008 at 2:31 PM, Serdar Tumgoren zstumgo...@gmail.com wrote: Hey everyone, I was wondering if there is a way to use the datetime module to check for variations on a month name when performing a regex match? Parsing arbitrary dates is hard, in fact the most general case is not solvable since 7/8/07 could refer to July 8 or August 7 and the century must be guessed. The only approach I know of is to try the formats you expect until you find one that works. Mark Pilgrim's feedparser has a pretty robust date parser you might look at; look at the _parse_date() function in this module: http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] advice on regex matching for dates?
Serdar Tumgoren a écrit : Hey everyone, I was wondering if there is a way to use the datetime module to check for variations on a month name when performing a regex match? In the script below, I created a regex pattern that checks for dates in the following pattern: August 31, 2007. If there's a match, I can then print the capture date and the line from which it was extracted. While it works in this isolated case, it struck me as not very flexible. What happens when I inevitably get data that has dates formatted in a different way? Do I have to create some type of library that contains variations on each month name (i.e. - January, Jan., 01, 1...) and use that to parse each line? Or is there some way to use datetime to check for date patterns when using regex? Is there a best practice in this area that I'm unaware of in this area? Apologies if this question has been answered elsewhere. I wasn't sure how to research this topic (beyond standard datetime docs), but I'd be happy to RTM if someone can point me to some resources. Any suggestions are welcome (including optimizations of the code below). Regards, Serdar #!/usr/bin/env python import re, sys sourcefile = open(sys.argv[1],'r') pattern = re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear\d{4})') pattern2 = re.compile(r'Return to List') counter = 0 for line in sourcefile: x = pattern.search(line) break_point = pattern2.match(line) if x: counter +=1 print %s %d, %d == %s % ( x.group('month'), int(x.group('day')), int(x.group('year')), line ), elif break_point: break print counter sourcefile.close() I just found a simple, but nice, trick to make regexes less unlegible. Using substrings to represent sub-patterns. E.g. instead of: p = re.compile(r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)\s(?Pday\d{1,2}),\s(?Pyear\d{4})') write first: month = r'(?PmonthJanuary|February|March|April|May|June|July|August|September|October|November|December)' day = r'(?Pday\d{1,2})' year = r'(?Pyear\d{4})' then: p = re.compile( r%s\s%s,\s%s % (month,day,year) ) or even: p = re.compile( r%(month)s\s%(day)s,\s%(year)s % {'month':month,'day':day,'year':year} ) denis ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] advice on regex matching for dates?
On 12/12/2008, spir denis.s...@free.fr wrote: I just found a simple, but nice, trick to make regexes less unlegible. Using substrings to represent sub-patterns. E.g. instead of: [...] Another option is to use the re.VERBOSE flag. This allows you to put comments in your regular expression and use whitespace for formatting. See this article for an example: http://diveintopython.org/regular_expressions/verbose.html HTH! -- John. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] advice on regex matching for dates?
Thanks to all for the replies! The FeedParser example is mind-bending (at least for this noob), but of course looks very powerful. And thanks for the tips on cleaning up the appearance of the code with sup-patterns. Those regex's can get hairy pretty fast. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor