On Fri, 28 May 2004, Andrew Gaffney wrote:

> I'm trying to write a regex to parse the following data. Each group is a string
> to parse.
>
> <td class="f3" colspan="2" width="48">05/28/04</td>
> <td class="f3" colspan="2" width="60"></td>
> <td class="f3" colspan="2" width="186">Purchase With Pin Pin</td>
> <td class="f3" colspan="2" align="right" width="78"></td>
> <td class="f3" colspan="2" align="right" width="78">$10.00<br>(pending)<a
> href='javascript: ShowHelp("PENDING TRANSACTION")'><img src="usb
> ank_files/help.gif" valign="middle" alt="Pending Transaction Help"
> border="0"></a></td>
> <td class="f3" align="right">$1,224.45</td>
>
> <td class="f3" colspan="2" width="48">05/27/04</td>
> <td class="f3" colspan="2" width="60"></td>
> <td class="f3" colspan="2" width="186">Purchase With Pin Shell Service Stlake
> St. Loumo</td>
> <td class="f3" colspan="2" align="right" width="78"></td>
> <td class="f3" colspan="2" align="right" width="78">$1.78</td>
> <td class="f3" align="right">$1,234.45</td>
>
> <td class="f3" colspan="2" width="48">05/21/04</td>
> <td class="f3" colspan="2" width="60"></td>
> <td class="f3" colspan="2" width="186">Atm Withdrawal One O'fallon Squo'fallon
> Mo 1</td>
> <td class="f3" colspan="2" align="right" width="78"></td>
> <td class="f3" colspan="2" align="right" width="78">$20.00</td>
> <td class="f3" align="right"><a href='javascript:
> ShowHelp("NOTE","RESTRICTEDFUNDSAMOUNT=$2.00","AVAILABLETRANSACTIONAMOUNT=$1,132.79")'>$
> 1,134.79</a></td>
>
> This is the regex I put together:
>
>      my $regex = '<td[^>]+?>(\d{2})/(\d{2})/(\d{2})</td>.+?';
>      $regex   .= '<td[^>]+?>(.*?)</td>.+?';
>      $regex   .= '<td[^>]+?>(.+?)</td>.+?';
>      $regex   .= '<td[^>]+?>(?:\$(\d+\.\d{2})).*?</td>.+?';
>      $regex   .= '<td[^>]+?>(?:\$(\d+\.\d{2})).*?</td>.+?';
>      $regex   .= '<td[^>]+?>.*?(?:\$(\d+\.\d{2})).*?</td>';
>
> The first field will always be in the form 'mm/dd/yy'. The second and third
> field need to be captured as they are. As for the fourth and fifth fields, only
> one will contain a value. The other one will be empty (nothing between
> <td></td>). The format is '$123.45' with the possibility of trailing HTML before
> the </td>. I only want the number without the $. The sixth field will contain a
> dollar amount like the fourth and fifth fields. It could be surrounded by HTML.
> Again, I only need the number without the $. What is wrong with the above regex?
> I am using it with the 's' modifier.
>
> --
> Andrew Gaffney
> Network Administrator
> Skyline Aeronautics, LLC.
> 636-357-1548
>
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
>
It seems two things are missing:

1) A '?' after the 4th and  5th group (because they may be empty).
2) Include ',' on the regex matching the amounts (to match '1,234.45' for
example).

So the regex would be:

my $regex = '<td[^>]+?>(\d{2})/(\d{2})/(\d{2})</td>.+?';
$regex   .= '<td[^>]+?>(.*?)</td>.+?';
$regex   .= '<td[^>]+?>(.+?)</td>.+?';
$regex   .= '<td[^>]+?>(?:\$([\d,]+\.\d{2}))?.*?</td>.+?';
$regex   .= '<td[^>]+?>(?:\$([\d,]+\.\d{2}))?.*?</td>.+?';
$regex   .= '<td[^>]+?>.*?(?:\$([\d,]+\.\d{2})).*?</td>';


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to