Re: Complex regex help

2006-12-01 Thread D. Bolliger
Omega -1911 am Freitag, 1. Dezember 2006 06:05:
 Hello all,

 I am trying to parse calendar events for a rss feed into variables. Can
 someone help with building the following regex or point me in the direction
 of some good examples? Thanks in advance.

 Here is what I have tried:  (I don't know much about complex regex's as you
 see)
 $mystring =~ /.+(plib)(\w+) (FONT COLOR=\\#99\)(\w+)(\[Ref
 \#(\d+\])(.+)$/);


 Here is a sample string:
 plib DATE FONT COLOR=#99TITLE/FONT/b EVENT a href=
 http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li

 What I would like to pull out is the TITLE  EVENT information. The sample
 string is the format for each event. Any takers on this? Again, thanks for
 any help.

If you *really* want do it with a regex, and not a parser (XML::LibXML, 
XML::Simple, etc.), here is one possibility.

However, note that a regex is very fragile if it comes to format changes, or 
the input has unexpected chars in it. In the regex below, I try to be 
flexible concerning white space in the input; one could also be more specific 
in the part following the info to extract. 

There are generally two somehow contradicting aims:
- be most specific to not match unwanted content
- be liberal to handle format changes

How did you develop the regex? It seems not to match as you liked. One way is 
to build it step by step; starting with matching strings between p/p, 
ckecking, be more specific, checking etc.

Note that I escape the '#' in the regex because of the /x modifier that allows 
comments.

BEWARE: Id did not spend hours. It just extracts what you want from the $input 
present.

#!/usr/bin/perl
use strict; use warnings;

my $input='
plib DATE FONT COLOR=#99TITLE1/FONT/b EVENT1
a href=http://www.mysite.comtarget=_new;www.mysite.com/a
[Ref #67579]/li/p
plib DATE FONT COLOR=#99TITLE2/FONT/b EVENT2
a href=http://www.mysite.comtarget=_new;www.mysite.com/a
[Ref #67579]/li/p
';


my %info = $input =~ m;
  p\s*
li\s*
  b.*?
font\s*color\s*=\s*\#99[^]*?\s*(.*?)\s*/font\s*
  /b\s*(.*?)\s*a.*?/a\s*\[ref[^\]]+?\]\s*
/li\s*
  /p
;mgxsi;

print map { $_ = $info{$_}\n } sort keys %info;

__END__

Dani

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Complex regex help

2006-12-01 Thread Rob Dixon

Omega -1911 wrote:

Hello all,

I am trying to parse calendar events for a rss feed into variables. Can
someone help with building the following regex or point me in the direction
of some good examples? Thanks in advance.

Here is what I have tried:  (I don't know much about complex regex's as you
see)
$mystring =~ /.+(plib)(\w+) (FONT COLOR=\\#99\)(\w+)(\[Ref
\#(\d+\])(.+)$/);


Here is a sample string:
plib DATE FONT COLOR=#99TITLE/FONT/b EVENT a href=
http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li

What I would like to pull out is the TITLE  EVENT information. The sample
string is the format for each event. Any takers on this? Again, thanks for
any help.


Hi Dave

Better than using regexes to extract the information, which are notoriously poor
at processing HTML, would be to use one of the the bespoke HTML parsing modules.
My preference is HTML::TreeBuilder, which builds a structure of HTML::Element
objects to represent the original document. From that it is easy to extract the
parts you need according to their context.

Can you let us have a URL for the information so that we can help you a little
better? Or at least an example with several records that you need to process.

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Complex regex help

2006-12-01 Thread Omega -1911

Hi Rob  Dani,

Thanks for your help!!! I will try the suggestion you made Rob and as soon
as I finish typing this, I'll try Dani's code. I had someone by the name of
Chen Ken contact me off-list and provided me with the following regex that
appeared to work. Please let me know what you think:

my( $title,  $event) = $data_string =~ m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|;


Re: Complex regex help

2006-12-01 Thread D. Bolliger
Omega -1911 am Freitag, 1. Dezember 2006 19:01:
 Hi Rob  Dani,

Hello Omega

 Thanks for your help!!! I will try the suggestion you made Rob and as soon
 as I finish typing this, I'll try Dani's code. I had someone 
 contact me off-list and provided me with the following regex that 
 appeared to work. Please let me know what you think:

 my( $title,  $event) = $data_string =~
 m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|;

First I'd like to emphasize that Rob's suggestion (use a parser module, not a 
regex) is really the preferred way. Consequently, I should not have mentioned 
a regex...

I see at least the following problems with the above regex
(there are others, as well as in mine):

- It captures three peaces of the input, while on the left side are only
  two variables to put the peaces in.
- When I run it, it matches too much into $event (event up to the Ref #
  without the trailing ']' - the latter is put in $3).
- ...

In short: A parser module will avoid a lot of trickiness, pitfalls, error 
proneness, problems [pick the right english term(s) if present] :-)

Just forget the regex path.

Dani

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Complex regex help

2006-12-01 Thread Rob Dixon

Omega -1911 wrote:


Hi Rob  Dani,

Thanks for your help!!! I will try the suggestion you made Rob and as soon
as I finish typing this, I'll try Dani's code. I had someone by the name of
Chen Ken contact me off-list and provided me with the following regex that
appeared to work. Please let me know what you think:

my( $title,  $event) = $data_string =~
m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|;


Hello Dave

You will need help to use HTML::TreeBuilder as it's fairly complex, and to help
you we need fuller information on the HTML you're processing. Can you publish a
bigger chunk? Or, better still, the URL where it is coming from?

The regex doesn't look right at all, the (?: .. ) around the closing font and
bold tags has no effect, and the ] in the character class needn't be escaped.
Apart from that it will grab everything from EVENT up to the end of the Ref #
value into $event and the closing ] into $3 which is then discarded. Not good at
all.

Against my better judgement I could offer

 my @stuff = $data =~ /\s*([^]+)\s*/g;

which will return all the text between the HTML tags, but this will fall down if
you have something like i.../i in the middle of one of the fields, which
will result in the text being broken into multiple segments. Better all round to
use a proper parser.

HTH,

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Complex Regex.

2001-04-24 Thread Paul


--- [EMAIL PROTECTED] wrote:
 I thought I was improving at expressions but this one has me stumped:
 
 I have text interspersed with numbers.  The text can be anything,
 including all types of punctuation marks.
 
 Well let me give an example:  
 
 The Text has numbers in it apparently-1.0 at random but 12 actually
 not...
 $%  12.3   0.9  .333  33 and -33.909  I need to extract
 ut(y2335.09u see
 
 all the legiti5.33mate integers and floating point numbers and
 separate them
 with CRs.  All text and nun-numeric data should be chucked.  it needs
 to
 look for negative numbers but 2-1 and 4-66.7  are 2  1  4  66.7.  
 That is
 - is a delimiter unless there's a space in front in which case the
 number is
 negative.
 
 The output for the text above should be:
 
 -1.0  
 12
 12.3
 0.9
 .333
 33
 -33.909  
 2335.09
 5.33
 2
 1
 4
 66.7
 2
 1
 2
 66.7
 
 Any help on the above would be GREATLY appreciated.  

Okay, I got this to work. 
here's the output:

The text:
 
The Text has numbers in it apparently-1.0 at random but 12 actually
not...
$%  12.3   0.9  .333  33 and -33.909  I need to extract
ut(y2335.09u see
all the legiti5.33mate integers and floating point numbers and separate

them
with CRs.  All text and nun-numeric data should be chucked.  it needs
to
look for negative numbers but 2-1 and 4-66.7  are 2  1  4  66.7.  
That 
is
- is a delimiter unless there's a space in front in which case the
number 
is
negative.
 
The output:
1.0
12
12.3
0.9
.333
33
-33.909
2335.09
5.33
2
1
4
66.7
2
1
4
66.7

Here's the code:

 use strict;  # good habit
 open (FILE,$ARGV[0]) or die $!;  # as a script, processes first arg
 my($data,$result);  # some working vars
 { local $/ = undef;  # set up to 
   $data = FILE;# slurp the infile into $data
 }# close the scope of the local()
 close(FILE); # close the arg file
 $result = join \r\n, $data =~ /((?: -)?\d*[.]?\d+)/sog;
 $result =~ s/^ //omg;# polish out the leading spaces
 print The text:\n $data \nThe output:\n$result\n;

and the regex elaborated:

 /((?: -)?\d*[.]?\d+)/sog;

First, I check for negatives, which you said would have a
space-and-then-a-minus. I wrap them with a (?:), which doesn't make
another variable, but lets me group them. That way the return should
still be a neat array, but I can say (?: -)?, which means zero or one
 ' -'s. The space-minus becomes part of the pattern, but
anything-else-minus is ignored. 

The rest of the pattern is simple enough: any number of digits
(including zero) followed by one-or-no decimals, followed by one or
more digits. That gets (using zero for any digit here) 0.0, .0, and 0,
so we've covered 0.0,  -0.0, .0,  -.0, 0, and -0.

The join stacks them with CRLF's between, accomplishing everything but
erasing the leading space on negatives.

 $result =~ s/^ //omg;# polish out the leading spaces

So, we treat $result as multiple lines so that ^ matches after every
newline, and knock off the spaces. Done!

Hope that helps, and feel free to ask questions!

Paul

__
Do You Yahoo!?
Yahoo! Auctions - buy the things you want at great prices
http://auctions.yahoo.com/



Re: Complex Regex.

2001-04-24 Thread Timothy Kimball


Slight correction to my last post:

: my $float_re = qr{
: 
: \d+\.\d+  # Matches 2.3
:   | \d+\. # Matches 2.
:   |\.=d+  # Matches .2
:   | \d+   # Matches 2
: 
: }x; # x means extended regex syntax

In the third line of the first regex, = should be \.

-- tdk



Re: Complex Regex.

2001-04-24 Thread David H. Adler

On Tue, Apr 24, 2001 at 01:38:45PM -0400, Timothy Kimball wrote:
 
 I've only been on the list a couple of days, and I've already seen a
 couple of questions about regexes matching numbers.

...and I don't remember anyone mentioning Damian Conway's mind-boggling
Regexp::Common module:

   By default, this module exports a single hash (%RE) that stores
   or generates commonly needed regular expressions (see the section
   on List of available patterns).

Which might simplify this greatly for numbers of any complexity.

dha
-- 
David H. Adler - [EMAIL PROTECTED] - http://www.panix.com/~dha/
All right!  So I'm the daughter of poison gas!
- Sybil Crane, The Big Bus