Re: HTML::Entities chokes on XML::Parser strings

2002-05-07 Thread Gisle Aas

John Siracusa <[EMAIL PROTECTED]> writes:

> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (Ã is a
> > commonly seen UTF-8 escape sequence).  XML::Parser converts all
> > incoming text into UTF-8.  You will need to convert it back to
> > iso-8859-1.
> > 
> > My favorite is Text::Iconv
> > 
> >use Text::Iconv;
> >$utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> > 
> >my $buffer_latin1 = $converter->convert($buffer);
> 
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?

Not true.  But the unicode support in perl-5.6.x has many bugs.  With
5.8 things will be better.  It is a bad idea for XML::Parser to give
out strings with the UTF8 flag set.

Regards,
Gisle



Re: HTTP::Cookies problem

2001-05-03 Thread Gisle Aas

Jonas Nordström <[EMAIL PROTECTED]> writes:

> How can I copy cookies from an incoming request to a LWP-request and also
> add a custom cookie? Can I use HTTP::Cookies?
> 
> I use:
> $request->header('Cookie' => $r->header_in("Cookie")); 
> and it works fine, but now I want to add a cookie that the client didn't
> send.
> Can I use $cookie_jar->set_cookie() and then
> $cookie_jar->add_cookie_header($request);? But what happens with the
> original cookies?

It goes away if any cookies from the $cookie_jar applies.
$cookie_jar->add_cookie_header() currently overrides the cookie header
by calling:

  $request->header(Cookie => join("; ", @cval)) if @cval;

If we change this to:

  $request->push_header(...)

then you get two header lines.  Don't know if most server apps can
deal with it.  Still anoter alternative would be to do something like:

  if (my $old_cookie = $request->header('Cookie')) {
   unshift(@cval, $old_cookie);
  }
  $request->header(Cookie => join("; ", @cval)) if @cval;

That should append to the current value if it was set.
Do you want this?

Regards,
Gisle



Re: Getting data from external URL

2000-08-29 Thread Gisle Aas

Steve Reppucci <[EMAIL PROTECTED]> writes:

> Just a word of warning: LWP::Simple doesn't follow redirects (at least,
> the last I checked, not sure if it's been changed in the 3 or 4
> months since I've last used it...),

If it does not follow redirects then that is a bug.  Do you have a
test case?

Not much has changed in the last 3 or 4 months either.

Regards,
Gisle



Re: [OT] & in URLs (was: Re: Templating System)

2000-07-28 Thread Gisle Aas

Michael Hanisch <[EMAIL PROTECTED]> writes:

> On Fri, 28 Jul 2000, Dave Jenkins wrote:
> 
> > > > Then you are wrong. :) You need to have & in there, so that the
> > > > browser can turn it back from & to & before sending the URL back
> > > > up to your server (or whichever server comes along).
> > > 
> > > Are you really positive about this?
> > 
> > 
> > 
> > I had a problem with certain URLs on IE4 a while back: given a link like...
> > 
> > ... it was turning the '§' bit into a section symbol, causing the link not
> > to work!
> > 
> >  
> 
> Yuck.
> Anybody else with similar problems?

I've seen this with some older versions of Netscape too.  But that's a
browser bug.  The whole name need after & to be considered before
entity substitution is performed.

With current browsers you should see problems with URLs like this:

foo

> I really believe my thoughts outlined in my original post are correct -
> but right now I am starting to worry...
> Personally I would attribute the described problem to a bug in IE4 - even
> if it parses the URI for entities, it shouldn't find a "§" since the
> trailing semi-colon is missing. (Aargh, feeling like a smart-ass again ;-)

The trailing semi-colon is optional when the entity is followed by a
non-word character like "=".  With Unicode we get many more names in
the entity name space so the risk of getting bitten by this increases.

> To be honest, I have always used plain ampersands in URLs embedded in my
> pages, and thus far I have never encountered any problems.
> But maybe I've just been lucky... ;-)

Good for you :-)

Regards,
Gisle



Re: Templating System (preview Embperl 2.0)

2000-07-28 Thread Gisle Aas

"Paul J. Lucas" <[EMAIL PROTECTED]> writes:

> On Fri, 28 Jul 2000, Gerald Richter wrote:
> 
> > As far as I understand you you use mmap to read in the source file, is this
> > correct?
> 
>   Yes.
> 
> > If this is true, then it will not make much difference, because reading in
> > the source is only a very small piece of all the time that it takes to
> > generate the output from a dynamic page.
> 
>   I suggest you do some benchmarks.  I have, albeit many months
>   ago.  If I recall correctly, I took Yahoo's home page and ran
>   it through my HTML Tree and that of Gisle Aas: HTML was about
>   7-8 times faster.

That does not show that mmap is superior.

I have not been able to build your module on my system, so I have not
been able to set up any benchmarks myself.  My guess is that you
compared your module parsing speed with that of the HTML::TreeBuilder
module?  Could you please be specific with what versions of the
modules you compared?  Perhaps also post the benchmark code you used.
Also tell me what HTML::Parser you had installed.  Did you use
HTML-Parser-3?

If your module is that much faster than the basic HTML::Parser then I
must be doing something very wrong.

> > My point was, that the C implementation of parsering and DOM tree
> > storage/caching, is much faster (uses much less memory) then doing the same
> > in Perl.
> 
>   ...and faster still with mmap(2).

I don't believe mmap buys you any significant.  And it has the
drawback that you can't parse from pipes or sockets.

I made this little test program.  I am not able to measure mmap to be
faster than fread on my system.  I am testing with Yahoo's home page.
Do you get different numbers?

Regards,
Gisle

--->8
#include 
#include 
#include 
#include 
#include 
#include 

void
with_mmap(char *file)
{
struct stat sbuf;
int fd;
void* area;
char* c;
int size;
unsigned int checksum;

fd = open(file, O_RDONLY);
fstat(fd, &sbuf);
size = sbuf.st_size;
area = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0);

/* read the mapped area */
checksum = 0;
c = area;
while (size--) {
checksum += *c++;
}
munmap(area, sbuf.st_size);

printf("%s: sum=%x\n", file, checksum);
}

void
with_fread(char *file)
{
FILE* f = fopen(file, "r");
char buf[32*1024];
unsigned int checksum;
size_t n;

checksum = 0;
while ( (n = fread(buf, 1, sizeof(buf), f))) {
char *c = buf;
int n_orig = n;
while (n--)
checksum += *c++;
if (n_orig < sizeof(buf))
break;
}
fclose(f);

printf("%s: sum=%x\n", file, checksum);
}



int
main(int argc, char** argv)
{
int i;
void (*f)(char*);

if (argc <= 1) {
fprintf(stderr, "Missing type\n");
return -1;
}

if (strcmp(argv[1], "mmap") == 0)
f = with_mmap;
else if (strcmp(argv[1], "fread") == 0)
f = with_fread;
else {
fprintf(stderr, "Bad type '%s'\n", argv[1]);
return -1;
}

for (i = 2; i < argc; i++) {
f(argv[i]);
}
return 0;
}



Re: [performance/benchmark] printing techniques

2000-06-11 Thread Gisle Aas

Stas Bekman <[EMAIL PROTECTED]> writes:

> And the results are:
> 
>   single_print:  1 wallclock secs ( 1.74 usr +  0.05 sys =  1.79 CPU)
>   here_print:3 wallclock secs ( 1.79 usr +  0.07 sys =  1.86 CPU)
>   list_print:7 wallclock secs ( 6.57 usr +  0.01 sys =  6.58 CPU)
>   multi_print:  10 wallclock secs (10.72 usr +  0.03 sys = 10.75 CPU)
> 
> Numbers tell it all, I<'single_print'> is the fastest, 'here_print' is
> almost of the same speed,

'single_print' and 'here_print' compile down to exactly the same code,
so there should not be any real difference between them.

-- 
Gisle Aas



Re: Novel technique for dynamic web page generation

2000-01-30 Thread Gisle Aas

"Paul J. Lucas" <[EMAIL PROTECTED]> writes:

> On 28 Jan 2000, Randal L. Schwartz wrote:
> 
> > Have you looked at the new XS version of HTML::Parser?
> 
>   Not previously, but I just did.
> 
> > It's a speedy little beasty.  I dare say probably faster than even
> > expat-based XML::Parser because it doesn't do quite as much.
> 
>   But still an order of magnitude slower than mine.  For a test,
>   I downloaded Yahoo!'s home page for a test HTML file and wrote
>   the following code:
> 
> - test code -
> #! /usr/local/bin/perl
> 
> use Benchmark;
> use HTML::Parser;
> use HTML::Tree;
> 
> @t = timethese( 1000, {
>'Parser' => '$p = HTML::Parser->new(); $p->parse_file( "/tmp/test.html" );',
>'Tree'   => '$html = HTML::Tree->new( "/tmp/test.html" );',
> } );
> -
> 
>   The results are:
> 
> - results -
> Benchmark: timing 1000 iterations of Parser, Tree...
> Parser: 37 secs (36.22 usr  0.15 sys = 36.37 cpu)
>   Tree:  7 secs ( 7.40 usr  0.22 sys =  7.62 cpu)
> ---
> 
>   One really can't compete against mmap(2), pointer arithmetic,
>   and dereferencing.

That's because you fall back to version 2 compatibility when you don't
provide any arguments to the HTML::Parser constructor.  The parser
will then make useless method calls for all stuff it finds, and method
calls with perl are not as cheap as I would wish.

- test code -
use Benchmark;
use HTML::Parser;

timethese( 1000, {
   'Parser' => '$p = HTML::Parser->new(); $p->parse_file( "./index.html" );',
   'Parser3' => 'HTML::Parser->new(api_version => 3)->parse_file( "./index.html" );'
} );
-

$ lwp-download http://yahoo.com
Saving to 'index.html'...
11.6 KB received in 2 seconds (5.8 KB/sec)

$ perl test.pl
Benchmark: timing 1000 iterations of Parser, Parser3...
Parser: 30 wallclock secs (29.31 usr +  0.20 sys = 29.51 CPU)
   Parser3:  2 wallclock secs ( 1.39 usr +  0.17 sys =  1.56 CPU)

...but this is kind of a useless benchmark, as it does not do anything.

Regards,
Gisle



HTML-Parser-XS-2.99_08 performance numbers

1999-11-11 Thread Gisle Aas

Dave Hodgkinson <[EMAIL PROTECTED]> writes:

> Do you have any numbers on speed?

These are some examples:

---
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

$doc = `cat $file`;

use HTML::Parser ();
use Time::HiRes qw(time);

$before = time;
for (1..10) {
HTML::Parser->new->parse_file($file);
}
printf "parse_file: %.1f seconds\n", time - $before;

$before = time;
for (1..10) {
HTML::Parser->new->parse($doc)->eof;
}
printf "parse: %.1f seconds\n", time - $before;
__END__


Prints:

   Parsing 138204 bytes
   parse_file: 7.1 seconds
   parse: 87.3 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   parse_file: 2.3 seconds
   parse: 2.1 seconds

when using HTML-Parser-XS-2.99_08.  We get a speedup of 41(!) times when
parsing from an inline string and 3 times when using the parse_file
method.  This also shows that the old parser was very bad at breaking
up large chunks.  The 'parse_file' method feeds the document in small
chunks.


---
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::LinkExtor ();
use Time::HiRes qw(time);

$count = 0;
$before = time;
for (1..10) {
HTML::LinkExtor->new(sub {$count++})->parse_file($file);
}
printf "Found $count links in %.1f seconds\n", time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Found 8770 links in 8.3 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Found 8770 links in 2.0 seconds

when using HTML-Parser-XS-2.99_08.  That is 4 times speedup.


---
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::TokeParser ();
use Time::HiRes qw(time);

$count = 0;
$before = time;
for (1..10) {
my $p = HTML::TokeParser->new($file);
while (my $t = $p->get_token) {
$count++;
}
}

printf "Processed $count tokens in %.1f seconds\n", time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Processed 80140 tokens in 11.5 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Processed 80140 tokens in 3.3 seconds

when using HTML-Parser-XS-2.99_08.  That is 3.5 times speedup.


---
#!/usr/bin/perl
$file = "/local/doc/html-spec/html4.0.1/interact/forms.html";
print "Parsing ", -s $file, " bytes\n";

use HTML::Parser ();
use Time::HiRes qw(time);

$count = 0;
$before = time;

if ($HTML::Parser::VERSION < 2.9) {
{
package MyParser;
require HTML::Entities;
@ISA=qw(HTML::Parser);
sub text { my $t = HTML::Entities::decode($_[1]); $main::count++}
}

for (1..10) {
my $p = MyParser->new;
$p->parse_file($file);
}
}
else {
for (1..10) {
my $p = HTML::Parser->new(text => sub {$count++},
  decode_text_entities => 1,
 );
$p->parse_file($file);
}
}

printf "Processed $count decoded text segments in %.2f seconds\n",
time - $before;
__END__

Prints:

   Parsing 138204 bytes
   Processed 32430 decoded text segments in 9.32 seconds

when using HTML-Parser-2.25, and

   Parsing 138204 bytes
   Processed 32430 decoded text segments in 0.76 seconds

when using HTML-Parser-XS-2.99_08.  This shows that the new library
also provide some new ways to do things that is good for speed.  Here
we get 12 times speedup.


---
But, then new library loads slighty slower (it has a dynamic C-library
to link):

  $ time for i in $(range 50); do perl -MHTML::Parser -le 'HTML::Parser->new; print 
$HTML::Parser::VERSION'; done

Takes 3.6 seconds with version 2.25 and 4.2 seconds with 2.99_08.
That is 17% slower startup.  This shouldn't matter much when you use
it under mod_perl though :-)

['range' is a little tool I have that will print the numbers 1..50 in
this case]

These tests where made on a SuSE Linux box with enough memory and a
350 Mhz Pentium II processor.

I expect the new parser to become a bit faster when I get to the point
where I try to optimize it.  Currently I am just trying to get all new
features implemented.

Regards,
Gisle



Re: Trying not to re-invent the wheel

1999-11-10 Thread Gisle Aas

"Christian Gilmore" <[EMAIL PROTECTED]> writes:

> I found that writing my own parser to fit my specific need was far
> and away the fastest thing I could do. It really depends upon your
> specific application. HTML::Parser is nice if you want to see the
> structure of the document your parsing but is just too slow to use
> for wresting particular tags from a document...

True. This was the main reason I started work on a new XS based
HTML::Parser a week ago.  It should make much of the performance
argument go away.  Still, most of the HTML that I have ever needed to
parse or manipulate is regular enough to make perl REs good enough.

Since HTML::Parser is XS based now I'm also able to offer many more
features without suffering performance.  I have attached a message I
sent to the <[EMAIL PROTECTED]> mailing list today describing what's
new.

Regards,
Gisle





I am now up to version 2.99_08 of the new HTML::Parser and I think it
comes along nicely.  As you might guess from the version number I am
aiming for version 3.00 when I think it is ready for general use.

I still encourage people to download it and test it out on various
platforms (at least check that 'make test' says everything is ok).
You can get it from:

   $CPAN/authors/id/GAAS/HTML-Parser-XS-2.99_08.tar.gz

Compatibility with HTML-Parser-2.2x is now perfect as far as I can tell.
The interfaces to all new features I still reserve the right to change
until 3.00-time.  There is still no documentation on the new things,
but the following text attempts explain most of them:

The main new feature is that instead of making a subclass you can just
provide callbacks to be invoked when various elements are recognised.
When one or more direct callbacks are provided, then no methods will
be called.

There is a new 'default' callback that is invoked with the text of
everything that there is no other callback registered for.  This might
for instance be used to implement a simple comment stripper by code
like this:

  HTML::Parser->new(comment => sub {}, # ignore
default => sub { print $_[0] },
   )->parse_file(shift);

(I actually thought I was very clever when I realized how handy this
would be, but later found out that XML::Parser already had exactly
this feature. :-)

Text handlers get an extra argument that is true if entities are
already expanded in the text string passed.  This was needed to handle

Re: MD5 risks?

1999-11-09 Thread Gisle Aas

Ben Bell <[EMAIL PROTECTED]> writes:

> On Tue, Nov 09, 1999 at 10:24:39AM +0800, Trevor Phillips wrote:
> > Another alternative is to get the MD5 base64 key to the URI. My query is, what
> > is the chance of two URI's giving the same MD5? Is there any risk in it, or is
> > MD5 guranteed to give unique ID's? (I know the risk would be SLIM, but how
> > slim?) Is MD5 used regularly for this kind of thing?
> Very slim :) something like 1/1000 billion, billion, billion, billion.
> for a match to a particular key, though about 1/1 billion billion for
> getting a collision in general. (In the region of 1 / (2^128) and 1 / (2^64)
> respectively.
>
> If you tack on the length of the string your MD5ing as well then you're
> pretty much safe.

I don't think this buy you much extra safety.  The length of URIs
don't vary significantly compared to MD5.

-- 
Gisle Aas