RE: read source file of .html

2002-01-15 Thread Gary Hawkins

That works.

It became tweaked a little, $page = shift to be able to alter the result, and a
'/' b/c a top-level URL without file name and without trailing forward slash
gets redirected on the server to the version with the trailing forward slash.
A little quicker.  In detail, I think that http://www.someplace.com/~user would
first look for a file called ~user and then say, doh, that must be a directory,
and find the index or default page for http://www.someplace.com/~user/, and
then display the latter, with the trailing slash.  Ok, too much information.

Thank you very much!

Gary

#!perl

use HTML::Parser 3;
use LWP::Simple;

my $html = get("http://www.mit.edu/";) or die "Couldn't fetch the page";
my $parser = HTML::Parser->new(
unbroken_text   => 1,
ignore_elements => [qw( script head )],
text_h  => [ sub { $page = shift; }, 'dtext']
)->parse($html)->eof();
$page =~ s#\n\s*\n#\n#g;
print $page;

__END__
..


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: read source file of .html

2002-01-15 Thread McCollum, Frank

Depending on what you are doing... I have found a lot of great ways to pull
out tables from HTML using HTML::TableExtract and LWP::UserAgent and
HTML::TreeBuilder.  I really haven't delved in to all of the libraries under
HTML, but these have been greate.  see cpan.org or perldoc for more info.



-Original Message-
From: Gary Hawkins [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, January 15, 2002 2:26 AM
To: [EMAIL PROTECTED]
Subject: RE: read source file of .html


> use LWP. it can be as simple as this :
> 
> 
> use LWP::Simple;
> print get("http://www.mit.edu";);
> 
> Tor.
> 

Neat.  

Along that line, I would like to be able to wind up with pages after
retrieval
as plain text without html tags, hopefully using a module. 

/g




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: read source file of .html

2002-01-14 Thread Briac Pilpré

Gary Hawkins wrote:
> Along that line, I would like to be able to wind up with pages after retrieval
> as plain text without html tags, hopefully using a module. 

Here's a really quick way to do so using HTML::Parser, it can probably
use some tweaking.

Hope this helps,
Briac

#!/usr/bin/perl -w
use strict;
use HTML::Parser 3;
use LWP::Simple;

my $html = get("http://www.mit.edu";) or die "Couldn't fetch the page";

my $parser = HTML::Parser->new(
unbroken_text   => 1,
ignore_elements => [qw( script head )],
text_h  => [ sub {print shift}, 'dtext']
)->parse($html)->eof();

__END__

-- 
briac
A flying lark. Five 
trout swim in the pond. Four foxes 
under a she-oak.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: read source file of .html

2002-01-14 Thread Gary Hawkins

> use LWP. it can be as simple as this :
> 
> 
> use LWP::Simple;
> print get("http://www.mit.edu";);
> 
> Tor.
> 

Neat.  

Along that line, I would like to be able to wind up with pages after retrieval
as plain text without html tags, hopefully using a module. 

/g





-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: read source file of .html

2002-01-14 Thread victor

use LWP. it can be as simple as this :


use LWP::Simple;
print get("http://www.mit.edu";);

Tor.

yun yun wrote:

> if I want to read the real html file from web, such as
> http://www.mit.edu, should I use sock programming? and
> if then, how could I use,and where can I study this
> aspect? Thanks!
>
> _
> Do You Yahoo!? µÇ¼Ãâ·ÑÑÅ»¢µçÓÊ! http://mail.yahoo.com.cn
>
> ÎÞÁÄ£¿ÓôÃÆ£¿¸ßÐË£¿Ã»ÀíÓÉ£¿¶¼À´ÁÄÌì°É£¡¡ª¡ª
> ÑÅ»¢??ÐÂÁÄÌìÊÒ! http://cn.chat.yahoo.com/c/roomlist.html
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




read source file of .html

2002-01-14 Thread yun yun

if I want to read the real html file from web, such as
http://www.mit.edu, should I use sock programming? and
if then, how could I use,and where can I study this
aspect? Thanks!

_
Do You Yahoo!? µÇ¼Ãâ·ÑÑÅ»¢µçÓÊ! http://mail.yahoo.com.cn

ÎÞÁÄ£¿ÓôÃÆ£¿¸ßÐË£¿Ã»ÀíÓÉ£¿¶¼À´ÁÄÌì°É£¡¡ª¡ª 
ÑÅ»¢È«ÐÂÁÄÌìÊÒ! http://cn.chat.yahoo.com/c/roomlist.html

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




read source file of .html

2002-01-14 Thread yun yun

if I want to read the real html file from web, such as
http://www.mit.edu, should I use sock programming? and
if then, how could I use,and where can I study this
aspect? Thanks!

_
Do You Yahoo!? µÇ¼Ãâ·ÑÑÅ»¢µçÓÊ! http://mail.yahoo.com.cn

ÎÞÁÄ£¿ÓôÃÆ£¿¸ßÐË£¿Ã»ÀíÓÉ£¿¶¼À´ÁÄÌì°É£¡¡ª¡ª 
ÑÅ»¢È«ÐÂÁÄÌìÊÒ! http://cn.chat.yahoo.com/c/roomlist.html

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]