At 09:53 2002-07-02 -0500, Kenny G. Dubuisson, Jr. wrote: >[...]I don't understand what it means by NetLoc and Realm[...]
Try hitting http://www.unicode.org/mail-arch/unicode-ml and it'll say "enter your username and password for Unicode-MailList-Archives". That "Unicode-MailList-Archives" string is the realm name. NetLoc is the hostname plus colon plus the port number, by default ":80" -- in this case, "www.unicode.org:80". Here's an extract from chapter 11 of my new book, /Perl and LWP/ (<http://www.amazon.com/exec/obidos/ASIN/0596001789>) which you might find useful and worth buying: Authenticating via LWP To add a username and password to a browser object's key ring, call the credentials method on a user agent object: $browser->credentials( 'servername:portnumber', 'realm-name', 'username' => 'password' ); In most cases, the port number is 80, the default TCP/IP port for HTTP. For example: my $browser = LWP::UserAgent->new; $browser->name('ReportsBot/1.01'); $browser->credentials( 'reports.mybazouki.com:80', 'web_server_usage_reports', 'plinky' => 'banjo123' ); my $response = $browser->get( 'http://reports.mybazouki.com/this_week/' ); One can call the credentials method any number of times, to add all the server-port-realm-username-password keys to the browser's key ring, regardless of whether they'll actually be needed. For example, you could read them all in from a datafile at startup: my $browser = LWP::UserAgent->new( ); if(open(KEYS, "< keyring.dat")) { while(<KEYS>) { chomp; my @info = split "\t", $_, -1; $browser->credential(@info) if @info == 4; } close(KEYS); } Security Clearly, storing lots of passwords in a plain text file is not terribly good security practice, but the obvious alternative is not much better: storing the same data in plain text in a Perl file. One could make a point of prompting the user for the information every time,* instead of storing it anywhere at all, but clearly this is useful only for interactive programs (as opposed to a programs run by crontab, for example). In any case, HTTP Basic Authentication is not the height of security: the username and password are normally sent unencrypted. This and other security shortcomings with HTTP Basic Authentication are explained in greater detail in RFC 2617. See the Preface for information on where to get a copy of RFC 2617. * In fact, Ave Wrigley wrote a module to do exactly that. It's not part of the LWP distribution, but it's available in CPAN as LWP::AuthenAgent. The author describes it as "a simple subclass of LWP::UserAgent to allow the user to type in username/password information if required for authentication." An HTTP Authentication Example: The Unicode Mailing Archive Most password-protected sites (whether protected via HTTP Basic Authentication or otherwise) are that way because the sites' owners don't want just anyone to look at the content. And it would be a bit odd if I gave away such a username and password by mentioning it in this book! However, there is one well-known site whose content is password protected without being secret: the mailing list archive of the Unicode mailing lists. In an effort to keep email-harvesting bots from finding the Unicode mailing list archive while spidering the Web for fresh email addresses, the Unicode.org sysadmins have put a password on that part of their site. But to allow people (actual not-bot humans) to access the site, the site administrators publicly state the password, on an unprotected page, at http://www.unicode.org/mail-arch/, which links to the protected part, but also states the username and password you should use. The main Unicode mailing list (called unicode) once in a while has a thread that is really very interesting and you really must read, but it's buried in a thousand other messages that are not even worth downloading, even in digest form. Luckily, this problem meets a tidy solution with LWP: I've written a short program that, on the first of every month, downloads the index of all the previous month's messages and reports the number of messages that has each topic as its subject. The trick is that the web pages that list this information are password protected. Moreover, the URL for the index of last month's posts is different every month, but in a fairly obvious way. The URL for March 2002, for example, is: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/ Deducing the URL for the month that has just ended is simple enough: # To be run on the first of every month... use POSIX ('strftime'); my $last_month = strftime("y%Y-m%m", localtime(time - 24 * 60 * 60)); # Since today is the first, one day ago (24*60*60 seconds) is in # last month. my $url = "http://www.unicode.org/mail-arch/unicode-ml/$last_month/"; But getting the contents of that URL involves first providing the username and password and realm name. The Unicode web site doesn't publicly declare the realm name, because it's an irrelevant detail for users with interactive browsers, but we need to know it for our call to the credential method. To find out the realm name, try accessing the URL in an interactive browser. The realm will be shown in the authentication dialog box, as shown in Figure 11-1. In this case, it's "Unicode-MailList-Archives," which is all we needed to make our request. my $browser = LWP::UserAgent->new; $browser->credentials( 'www.unicode.org:80', # Don't forget the ":80"! # This is no secret... 'Unicode-MailList-Archives', 'unicode-ml' => 'unicode' ); print "Getting topics for last month, $last_month\n", " from $url\n"; my $response = $browser->get($url); die "Error getting $url: ", $response->status_line if $response->is_error; If this fails (if the Unicode site's admins have changed the username or password or even the realm name), that will die with this error message: Error getting http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/: 401 Authorization Required at unicode_list001.pl line 21. But assuming the authorization data is correct, the page is retrieved as if it were a normal, unprotected page. From there, counting the topics and noting the absolute URL of the first message of each thread is a matter of extracting data from the HTML source and reporting it concisely. my(%posts, %first_url); while( ${ $response->content_ref } =~ m{<li><a href="(\d+.html)"><strong>(.*?)</strong>}g # Like: <li><a href="0127.html"><strong>Klingon</strong> ) { my($url, $topic) = ($1,$2); # Strip any number of "Re:" prefixes. while( $topic =~ s/^Re:\s+//i ) {} ++$posts{$topic}; use URI; # For absolutizing URLs... $first_url{$topic} ||= URI->new_abs($url, $response->base); } print "Topics:\n", reverse sort map # Most common first: sprintf("% 5s %s\n %s\n", $posts{$_}, $_, $first_url{$_} ), keys %posts; Typical output starts out like this: Getting topics for last month, y2002-m02 from http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/ Topics: 86 Unicode and Security http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0021.html 47 ISO 3166 (country codes) Maintenance Agency Web pages move http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0390.html 41 Unicode and end users http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0260.html 27 Unicode Search Engines http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0360.html 22 Smiles, faces, etc http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0275.html 18 This spoofing and security thread http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0216.html 16 Standard Conventions and euro http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0418.html This continues for a few pages. [end extract] -- Sean M. Burke http://www.spinn.net/~sburke/
