On Mon, Aug 13, 2007 at 10:16:26AM -0700, Scott Haneda wrote:
> I need an automated way to deal with the following:
> 
> I get an email (plain text) from a client, it has some long urls in it, many
> actually.  So some hard wrap, they are all usually wrapped in < url > style
> 
> I need to take lines that are urls and url encode them, any suggestions?
> 
> So I may get a file like this
> 
> 
> This is a test title
> <http://www.example.com/da/erwq/dsa/index.html?foo=bar&id=12>
> 
> This is a test title #2
> <http://www.example.com/da/erwq/dsa/index.html?foo=bar&id=12&blue=red&green=
> blue>
> 
> So I just want to alter the urls and url encode them, as a final step, I
> want to find and replace on:
> http://www.example.com to <http://www.example.com/index.html?jump=

The use of angle-brackets definitely helps, it would be a pain to
do this without them.  Here's a way to rewrap the URLs with grep (assuming
that the URLs aren't wrapped onto 3 or more lines).

Find

<(https?:.*?)(?:\r(.*))?>

Replace

<\1\2>


For the second part, you'll want to use a script of some kind, because the
original URL needs to be URL-encoded if it's going to be used in the query
string.

Here's a Perl script that does both tasks.  It fixes URLs wrapped over any
number of lines.

#!perl

use warnings;
use strict;

use URI::Escape;

local $/;

$_ = <>;

# fix line-wrapped URLs
s/<(https?:.*?)>/my $url = $1; $url =~ tr,\n,,d; "<$url>"/sge;

# add index.html?jump= to URLs that don't already have it
s{<(https?://[^/]+/)(?!index\.html\?jump=)(.*?)>}
 {"<$1index.html?jump=" . uri_escape($2) . '>'}ge;

print;

__END__


HTH,
Ronald

-- 
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <[EMAIL PROTECTED]>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to:  <[EMAIL PROTECTED]>

Reply via email to