On 3/13/06, Martin <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying out to get a an attribute value from the given text. Kindly help > me > out in this regard. > > Input text: > > The Bill for this Act of the Scottish Parliament was passed by the Parliament > on > 15th December 2005 and received Royal Assent on 20th January 2006 > > Output needed: > > <assent date="20060120">The Bill for this Act of the Scottish Parliament was > passed by the Parliament on 15th December 2005 and received Royal Assent > on 20th January 2006</assent> > > Note: The date attribute should be recovered from the highlighted text. > > Regards, > Martin. snip
You should avoid trying to send formatted text to a mailing list (there is no highlighted text). I assume you mean to get the date from the text "received Royal Assent on 20th January 2006". This is easy enough to do with a regular expression and a couple of hashes to map text like "January" to a formatted number like 01. Your largest problem (if this is a real world application and not a puzzle or homework assignment) is that normal text is never this clean. Here is how I would go about writing the regex needed to pull the information First we need to identify the parts of the string: 1. "received Royal Assent on " 2. "20" 3. "th" 4. " " 5. "January" 6. " " 7. "2006" Part 1 seems to be constant. Part 2 seems to be a one or two digit number representing the day of month Part 3 seems to be irrelevant, I think we just want to get rid of it Part 4 seems to be constant. Part 5 seems to be the month spelled out Part 6 seems to be constant. Part 7 seems to be a four digit year. Next we need to identify the parts we want to capture. We want the day, month, and year so that would be parts 2, 5, and 7. These will need to be capture groups in the eventual regex. Now that we know what the parts are lets start writing some regexs to match them Part 1 should just match the whole string: /received\sRoyal\sAssent\son\s/ Part 2 needs to match one or more digits (I am going to assume that these will fall into the right range 01 - 12, but the regex could take this into account as well): /\d{1,2}/ Part 3 seems to be two characters long and we don't care about it. It should be /../ or possibly /.*?/ if it isn't always two characters long. Part 4 is just a space /\s/ Part 5 is either a set constants or a word depending on how much validation you want to do: /January|February|March|April|May|June|July|August|September|October|November|December/ or /\w+/ respectively. Part 6 is another space: /\s/ Part 7 is a four digit year: /\d{4}/ Now we just need to combine the individual regexes and capture the parts we want: /received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/ Now that we know what the regex is we can add the tags: my %month = ( January => '01', February => '02', March => '03', April => '04', May => '05', June => '06', July => '07', August => '08', September => '09', October => '10', November => '11', December => '12' ); my %day = map { $_ => sprintf "%2.2d" $_ } 1 .. 31; if ($str =~ /received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/) { $str=qq(<assent date="$3$month{$2}$day{$1}">$str</assent>); } else { print stderr "could not find date this bill recieved Royal Assent: $str"; } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>