Re: Complex regex help
Omega -1911 am Freitag, 1. Dezember 2006 06:05: Hello all, I am trying to parse calendar events for a rss feed into variables. Can someone help with building the following regex or point me in the direction of some good examples? Thanks in advance. Here is what I have tried: (I don't know much about complex regex's as you see) $mystring =~ /.+(plib)(\w+) (FONT COLOR=\\#99\)(\w+)(\[Ref \#(\d+\])(.+)$/); Here is a sample string: plib DATE FONT COLOR=#99TITLE/FONT/b EVENT a href= http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li What I would like to pull out is the TITLE EVENT information. The sample string is the format for each event. Any takers on this? Again, thanks for any help. If you *really* want do it with a regex, and not a parser (XML::LibXML, XML::Simple, etc.), here is one possibility. However, note that a regex is very fragile if it comes to format changes, or the input has unexpected chars in it. In the regex below, I try to be flexible concerning white space in the input; one could also be more specific in the part following the info to extract. There are generally two somehow contradicting aims: - be most specific to not match unwanted content - be liberal to handle format changes How did you develop the regex? It seems not to match as you liked. One way is to build it step by step; starting with matching strings between p/p, ckecking, be more specific, checking etc. Note that I escape the '#' in the regex because of the /x modifier that allows comments. BEWARE: Id did not spend hours. It just extracts what you want from the $input present. #!/usr/bin/perl use strict; use warnings; my $input=' plib DATE FONT COLOR=#99TITLE1/FONT/b EVENT1 a href=http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li/p plib DATE FONT COLOR=#99TITLE2/FONT/b EVENT2 a href=http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li/p '; my %info = $input =~ m; p\s* li\s* b.*? font\s*color\s*=\s*\#99[^]*?\s*(.*?)\s*/font\s* /b\s*(.*?)\s*a.*?/a\s*\[ref[^\]]+?\]\s* /li\s* /p ;mgxsi; print map { $_ = $info{$_}\n } sort keys %info; __END__ Dani -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Complex regex help
Omega -1911 wrote: Hello all, I am trying to parse calendar events for a rss feed into variables. Can someone help with building the following regex or point me in the direction of some good examples? Thanks in advance. Here is what I have tried: (I don't know much about complex regex's as you see) $mystring =~ /.+(plib)(\w+) (FONT COLOR=\\#99\)(\w+)(\[Ref \#(\d+\])(.+)$/); Here is a sample string: plib DATE FONT COLOR=#99TITLE/FONT/b EVENT a href= http://www.mysite.comtarget=_new;www.mysite.com/a [Ref #67579]/li What I would like to pull out is the TITLE EVENT information. The sample string is the format for each event. Any takers on this? Again, thanks for any help. Hi Dave Better than using regexes to extract the information, which are notoriously poor at processing HTML, would be to use one of the the bespoke HTML parsing modules. My preference is HTML::TreeBuilder, which builds a structure of HTML::Element objects to represent the original document. From that it is easy to extract the parts you need according to their context. Can you let us have a URL for the information so that we can help you a little better? Or at least an example with several records that you need to process. Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Complex regex help
Hi Rob Dani, Thanks for your help!!! I will try the suggestion you made Rob and as soon as I finish typing this, I'll try Dani's code. I had someone by the name of Chen Ken contact me off-list and provided me with the following regex that appeared to work. Please let me know what you think: my( $title, $event) = $data_string =~ m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|;
Re: Complex regex help
Omega -1911 am Freitag, 1. Dezember 2006 19:01: Hi Rob Dani, Hello Omega Thanks for your help!!! I will try the suggestion you made Rob and as soon as I finish typing this, I'll try Dani's code. I had someone contact me off-list and provided me with the following regex that appeared to work. Please let me know what you think: my( $title, $event) = $data_string =~ m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|; First I'd like to emphasize that Rob's suggestion (use a parser module, not a regex) is really the preferred way. Consequently, I should not have mentioned a regex... I see at least the following problems with the above regex (there are others, as well as in mine): - It captures three peaces of the input, while on the left side are only two variables to put the peaces in. - When I run it, it matches too much into $event (event up to the Ref # without the trailing ']' - the latter is put in $3). - ... In short: A parser module will avoid a lot of trickiness, pitfalls, error proneness, problems [pick the right english term(s) if present] :-) Just forget the regex path. Dani -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Complex regex help
Omega -1911 wrote: Hi Rob Dani, Thanks for your help!!! I will try the suggestion you made Rob and as soon as I finish typing this, I'll try Dani's code. I had someone by the name of Chen Ken contact me off-list and provided me with the following regex that appeared to work. Please let me know what you think: my( $title, $event) = $data_string =~ m|([^]*)(?:/FONT/b)([^\]]*)([^]*)|; Hello Dave You will need help to use HTML::TreeBuilder as it's fairly complex, and to help you we need fuller information on the HTML you're processing. Can you publish a bigger chunk? Or, better still, the URL where it is coming from? The regex doesn't look right at all, the (?: .. ) around the closing font and bold tags has no effect, and the ] in the character class needn't be escaped. Apart from that it will grab everything from EVENT up to the end of the Ref # value into $event and the closing ] into $3 which is then discarded. Not good at all. Against my better judgement I could offer my @stuff = $data =~ /\s*([^]+)\s*/g; which will return all the text between the HTML tags, but this will fall down if you have something like i.../i in the middle of one of the fields, which will result in the text being broken into multiple segments. Better all round to use a proper parser. HTH, Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Complex Regex.
--- [EMAIL PROTECTED] wrote: I thought I was improving at expressions but this one has me stumped: I have text interspersed with numbers. The text can be anything, including all types of punctuation marks. Well let me give an example: The Text has numbers in it apparently-1.0 at random but 12 actually not... $% 12.3 0.9 .333 33 and -33.909 I need to extract ut(y2335.09u see all the legiti5.33mate integers and floating point numbers and separate them with CRs. All text and nun-numeric data should be chucked. it needs to look for negative numbers but 2-1 and 4-66.7 are 2 1 4 66.7. That is - is a delimiter unless there's a space in front in which case the number is negative. The output for the text above should be: -1.0 12 12.3 0.9 .333 33 -33.909 2335.09 5.33 2 1 4 66.7 2 1 2 66.7 Any help on the above would be GREATLY appreciated. Okay, I got this to work. here's the output: The text: The Text has numbers in it apparently-1.0 at random but 12 actually not... $% 12.3 0.9 .333 33 and -33.909 I need to extract ut(y2335.09u see all the legiti5.33mate integers and floating point numbers and separate them with CRs. All text and nun-numeric data should be chucked. it needs to look for negative numbers but 2-1 and 4-66.7 are 2 1 4 66.7. That is - is a delimiter unless there's a space in front in which case the number is negative. The output: 1.0 12 12.3 0.9 .333 33 -33.909 2335.09 5.33 2 1 4 66.7 2 1 4 66.7 Here's the code: use strict; # good habit open (FILE,$ARGV[0]) or die $!; # as a script, processes first arg my($data,$result); # some working vars { local $/ = undef; # set up to $data = FILE;# slurp the infile into $data }# close the scope of the local() close(FILE); # close the arg file $result = join \r\n, $data =~ /((?: -)?\d*[.]?\d+)/sog; $result =~ s/^ //omg;# polish out the leading spaces print The text:\n $data \nThe output:\n$result\n; and the regex elaborated: /((?: -)?\d*[.]?\d+)/sog; First, I check for negatives, which you said would have a space-and-then-a-minus. I wrap them with a (?:), which doesn't make another variable, but lets me group them. That way the return should still be a neat array, but I can say (?: -)?, which means zero or one ' -'s. The space-minus becomes part of the pattern, but anything-else-minus is ignored. The rest of the pattern is simple enough: any number of digits (including zero) followed by one-or-no decimals, followed by one or more digits. That gets (using zero for any digit here) 0.0, .0, and 0, so we've covered 0.0, -0.0, .0, -.0, 0, and -0. The join stacks them with CRLF's between, accomplishing everything but erasing the leading space on negatives. $result =~ s/^ //omg;# polish out the leading spaces So, we treat $result as multiple lines so that ^ matches after every newline, and knock off the spaces. Done! Hope that helps, and feel free to ask questions! Paul __ Do You Yahoo!? Yahoo! Auctions - buy the things you want at great prices http://auctions.yahoo.com/
Re: Complex Regex.
Slight correction to my last post: : my $float_re = qr{ : : \d+\.\d+ # Matches 2.3 : | \d+\. # Matches 2. : |\.=d+ # Matches .2 : | \d+ # Matches 2 : : }x; # x means extended regex syntax In the third line of the first regex, = should be \. -- tdk
Re: Complex Regex.
On Tue, Apr 24, 2001 at 01:38:45PM -0400, Timothy Kimball wrote: I've only been on the list a couple of days, and I've already seen a couple of questions about regexes matching numbers. ...and I don't remember anyone mentioning Damian Conway's mind-boggling Regexp::Common module: By default, this module exports a single hash (%RE) that stores or generates commonly needed regular expressions (see the section on List of available patterns). Which might simplify this greatly for numbers of any complexity. dha -- David H. Adler - [EMAIL PROTECTED] - http://www.panix.com/~dha/ All right! So I'm the daughter of poison gas! - Sybil Crane, The Big Bus