On Aug 20, 2013, at 7:02 AM, Natxo Asenjo wrote:

> hi,
> 
> for a nagios (monitoring system) check I need to scrape a web site
> (this is for a network device, a UPS, whatever). This particular
> device only offers some functionality through a web interface.
> 
> I get the content of the site using WWW::Mechanize after login in
> (this is really simple using the submit_form method, by the way).
> 
> Then I save the text of the website in a variable like this:
> 
> my $text = $mech->text();
> 
> if ( $text =~ /critical alarm/i ) {
>    print "Bingo\n";
> }
> 
> This works, if I unplug something I get the critical alarm, I replug
> the stuff and the string does not match anymore.
> 
> $text has this (very long line):
> 
> APC | UPS Network Management Card 2Skip to Main ContentUPS Network
> Management Card 2Smart-UPS/Matrix Application 1user | English | Log
> Off | Help |    HomeStatusUPSNetworkControlUPSSecuritySession
> ManagementNetworkReset/Reboot ConfigurationPower SettingsShutdownUPS
> GeneralSelf-Test ScheduleSchedulingPowerChute ClientsSync ControlThird
> Party SupportEnergyWiseSecuritySession ManagementPing ResponseLocal
> UsersManagementDefault SettingsRemote
> UsersAuthenticationRADIUSFirewall ConfigurationActive PolicyActive
> RulesCreate/Edit PolicyLoad PolicyTestNetworkTCP/IPIPv4 SettingsIPv6
> SettingsPort SpeedDNS ConfigurationTestWebAccessSSL
> CertificateConsoleAccessSSH Host KeySNMPv1AccessAccess
> ControlSNMPv3AccessUser ProfilesAccess ControlFTP
> serverNotificationEvent ActionsBy EventBy
> GroupE-mailServerRecipientsSSL CertificatesTestSNMP TrapsTrap
> ReceiversTestRemote
> MonitoringGeneralIdentificationDate/TimeModeDaylight SavingsUser
> Config FileQuick LinksLogsSyslogServersSettingsTestTestsUPSNetworkLed
> BlinkLogsEventsLogReverse
> LookupSizeDataLogIntervalRotationSizeFirewallAboutUPSNetworkSupport
> Smart-UPS 1400 RM: 1 Critical Alarm PresentA site wiring fault exists.
> Recent Device Events  DateTimeEventMore Events ›   Knowledge Base |
> Schneider Electric Product Center | Schneider Electric Downloads ©
> 2012, Schneider Electric. All rights reserved.
> 
> I am only interested in the text '1 Critical Alarm PresentA site
> wiring fault exists'; is it possible to match this is a simple way (in
> fact, the text after 'Critical Alarm Present' may vary, it would be
> awesome to be able to get that easily. Otherwise I am afraid I will
> have to start parsing html with HTML::TableExtract


The text in $text is not HTML, so it looks like $mech->text() is stripping out 
the HTML tags, extracting just the text, and returning that to your program. 
That is fine, but as you have discovered, it loses some of the structure of the 
original HTML page.

If you want to extract the text following 'Critical Alarm Present' no matter 
what it is, you can do this:

  if( $text =~ /Critical Alarm Present(.*)/i ) {
    my $message = $1;
    # process or print message
  }

That will give you the rest of the string, which is more than what you want. If 
the text following 'Critical Alarm Present' is always 'Recent Device Events', 
then you can improve the above to this:

  if( $text =~ /Critical Alarm Present(.*)Recent Device Events/i ) {
    my $message = $1;
    # process or print message
  }

That will give you only 'A site wiring fault exists.' in your sample case.

If that regular expression does not match, it could mean one of two things:

1. 'Critical Alarm Present' does not occur in the string, or

2. 'Critical Alarm Present' does occur, but it is not followed by 'Recent 
Device Events'

Because of the second case, if the regular expression does not match, I would 
then test for just the string 'Critical Alarm Present' (the first if statement) 
so that I would know if that is present in the text regardless of what follows.

You might actually have better results if you try parsing the original HTML. If 
the string you are looking for ('Critical Alarm Present') is part of a table, 
for example, then you can extract just the parts of that table that provide 
relevant information. It will be more work on your part, but you will end up 
with a more reliable solution.

Good luck.


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to