Re: Yet Another Regex Problem

Jeff 'japhy' Pinyan Tue, 08 Jun 2004 06:17:41 -0700

On Jun 8, Francesco del Vecchio said:

>I have to find URLs in a text file (so, cannot use LWP or HTML parser)


I'm curious why you can't use a module to extract URLs, but I'll continue
anyway.

>/(http.:\/\/.*\s)/

That regex is broken in a few ways.  First, it does NOT match 'http:', it
only matches 'http_:', where there is some character between the p and the
colon.  Second, the .* in it is greedy (it matches as much as it can).
Third, it requires your URL to be followed by a space, which won't always
be the case.

>"try to click here http://www.yahoo.com or there http://www.google.com";

I would suggest trying:

  @urls = $string =~ m{(https?://\S+)}g;

Using \S+ makes it match one or more non-whitespace characters.  The only
problem with this is that if there happens to be punctuation after the
URL, it'll get included.  An example is this:

  Go to http://www.yahoo.com, and you'll see what I mean.

That will match `http://www.yahoo.com,' (including the comma).

-- 
Jeff "japhy" Pinyan      [EMAIL PROTECTED]      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
CPAN ID: PINYAN    [Need a programmer?  If you like my work, let me know.]
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Yet Another Regex Problem

Reply via email to