On 3/13/06, Martin <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am trying out to get a an attribute value from the given text. Kindly help 
> me
> out in this regard.
>
> Input text:
>
> The Bill for this Act of the Scottish Parliament was passed by the Parliament 
> on
> 15th December 2005 and received Royal Assent on 20th January 2006
>
> Output needed:
>
> <assent date="20060120">The Bill for this Act of the Scottish Parliament was
> passed by the Parliament on 15th December 2005 and received Royal Assent
> on 20th January 2006</assent>
>
> Note: The date attribute should be recovered from the highlighted text.
>
> Regards,
> Martin.
snip

You should avoid trying to send formatted text to a mailing list
(there is no highlighted text).  I assume you mean to get the date
from the text "received Royal Assent on 20th January 2006".  This is
easy enough to do with a regular expression and a couple of hashes to
map text like "January" to a formatted number like 01.  Your largest
problem (if this is a real world application and not a puzzle or
homework assignment) is that normal text is never this clean.

Here is how I would go about writing the regex needed to pull the information

First we need to identify the parts of the string:
1. "received Royal Assent on "
2. "20"
3. "th"
4. " "
5. "January"
6. " "
7. "2006"

Part 1 seems to be constant.
Part 2 seems to be a one or two digit number representing the day of month
Part 3 seems to be irrelevant, I think we just want to get rid of it
Part 4 seems to be constant.
Part 5 seems to be the month spelled out
Part 6 seems to be constant.
Part 7 seems to be a four digit year.

Next we need to identify the parts we want to capture.  We want the
day, month, and year so that would be parts 2, 5, and 7.  These will
need to be capture groups in the eventual regex.

Now that we know what the parts are lets start writing some regexs to match them

Part 1 should just match the whole string: /received\sRoyal\sAssent\son\s/
Part 2 needs to match one or more digits (I am going to assume that
these will fall into the right range 01 - 12, but the regex could take
this into account as well): /\d{1,2}/
Part 3 seems to be two characters long and we don't care about it.  It
should be /../ or possibly /.*?/ if it isn't always two characters
long.
Part 4 is just a space /\s/
Part 5 is either a set constants or a word depending on how much
validation you want to do:
/January|February|March|April|May|June|July|August|September|October|November|December/
or /\w+/ respectively.
Part 6 is another space: /\s/
Part 7 is a four digit year: /\d{4}/

Now we just need to combine the individual regexes and capture the
parts we want:

/received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/

Now that we know what the regex is we can add the tags:

my %month = (
    January => '01',
    February => '02',
    March => '03',
    April => '04',
    May => '05',
    June => '06',
    July => '07',
    August => '08',
    September => '09',
    October => '10',
    November => '11',
    December => '12'
);

my %day = map { $_ => sprintf "%2.2d" $_ } 1 .. 31;

if ($str =~ /received\sRoyal\sAssent\son\s(\d{1,2})..\s(\w+)\s(\d{4})/) {
    $str=qq(<assent date="$3$month{$2}$day{$1}">$str</assent>);
} else {
    print stderr "could not find date this bill recieved Royal Assent: $str";
}

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to