On Mon, Aug 13, 2007 at 10:16:26AM -0700, Scott Haneda wrote:
> I need an automated way to deal with the following:
>
> I get an email (plain text) from a client, it has some long urls in it, many
> actually. So some hard wrap, they are all usually wrapped in < url > style
>
> I need to take lines that are urls and url encode them, any suggestions?
>
> So I may get a file like this
>
>
> This is a test title
> <http://www.example.com/da/erwq/dsa/index.html?foo=bar&id=12>
>
> This is a test title #2
> <http://www.example.com/da/erwq/dsa/index.html?foo=bar&id=12&blue=red&green=
> blue>
>
> So I just want to alter the urls and url encode them, as a final step, I
> want to find and replace on:
> http://www.example.com to <http://www.example.com/index.html?jump=
The use of angle-brackets definitely helps, it would be a pain to
do this without them. Here's a way to rewrap the URLs with grep (assuming
that the URLs aren't wrapped onto 3 or more lines).
Find
<(https?:.*?)(?:\r(.*))?>
Replace
<\1\2>
For the second part, you'll want to use a script of some kind, because the
original URL needs to be URL-encoded if it's going to be used in the query
string.
Here's a Perl script that does both tasks. It fixes URLs wrapped over any
number of lines.
#!perl
use warnings;
use strict;
use URI::Escape;
local $/;
$_ = <>;
# fix line-wrapped URLs
s/<(https?:.*?)>/my $url = $1; $url =~ tr,\n,,d; "<$url>"/sge;
# add index.html?jump= to URLs that don't already have it
s{<(https?://[^/]+/)(?!index\.html\?jump=)(.*?)>}
{"<$1index.html?jump=" . uri_escape($2) . '>'}ge;
print;
__END__
HTH,
Ronald
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <[EMAIL PROTECTED]>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <[EMAIL PROTECTED]>