Re: HTML::Entities chokes on XML::Parser strings
John Siracusa <[EMAIL PROTECTED]> writes: > On 5/7/02 10:58 AM, Paul Lindner wrote: > > The output from your example looks like UTF-8 data (Ã is a > > commonly seen UTF-8 escape sequence). XML::Parser converts all > > incoming text into UTF-8. You will need to convert it back to > > iso-8859-1. > > > > My favorite is Text::Iconv > > > >use Text::Iconv; > >$utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1"); > > > >my $buffer_latin1 = $converter->convert($buffer); > > So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? Not true. But the unicode support in perl-5.6.x has many bugs. With 5.8 things will be better. It is a bad idea for XML::Parser to give out strings with the UTF8 flag set. Regards, Gisle
Re: HTTP::Cookies problem
Jonas Nordström <[EMAIL PROTECTED]> writes: > How can I copy cookies from an incoming request to a LWP-request and also > add a custom cookie? Can I use HTTP::Cookies? > > I use: > $request->header('Cookie' => $r->header_in("Cookie")); > and it works fine, but now I want to add a cookie that the client didn't > send. > Can I use $cookie_jar->set_cookie() and then > $cookie_jar->add_cookie_header($request);? But what happens with the > original cookies? It goes away if any cookies from the $cookie_jar applies. $cookie_jar->add_cookie_header() currently overrides the cookie header by calling: $request->header(Cookie => join("; ", @cval)) if @cval; If we change this to: $request->push_header(...) then you get two header lines. Don't know if most server apps can deal with it. Still anoter alternative would be to do something like: if (my $old_cookie = $request->header('Cookie')) { unshift(@cval, $old_cookie); } $request->header(Cookie => join("; ", @cval)) if @cval; That should append to the current value if it was set. Do you want this? Regards, Gisle
Re: Getting data from external URL
Steve Reppucci <[EMAIL PROTECTED]> writes: > Just a word of warning: LWP::Simple doesn't follow redirects (at least, > the last I checked, not sure if it's been changed in the 3 or 4 > months since I've last used it...), If it does not follow redirects then that is a bug. Do you have a test case? Not much has changed in the last 3 or 4 months either. Regards, Gisle
Re: [OT] & in URLs (was: Re: Templating System)
Michael Hanisch <[EMAIL PROTECTED]> writes: > On Fri, 28 Jul 2000, Dave Jenkins wrote: > > > > > Then you are wrong. :) You need to have & in there, so that the > > > > browser can turn it back from & to & before sending the URL back > > > > up to your server (or whichever server comes along). > > > > > > Are you really positive about this? > > > > > > > > I had a problem with certain URLs on IE4 a while back: given a link like... > > > > ... it was turning the '§' bit into a section symbol, causing the link not > > to work! > > > > > > Yuck. > Anybody else with similar problems? I've seen this with some older versions of Netscape too. But that's a browser bug. The whole name need after & to be considered before entity substitution is performed. With current browsers you should see problems with URLs like this: foo > I really believe my thoughts outlined in my original post are correct - > but right now I am starting to worry... > Personally I would attribute the described problem to a bug in IE4 - even > if it parses the URI for entities, it shouldn't find a "§" since the > trailing semi-colon is missing. (Aargh, feeling like a smart-ass again ;-) The trailing semi-colon is optional when the entity is followed by a non-word character like "=". With Unicode we get many more names in the entity name space so the risk of getting bitten by this increases. > To be honest, I have always used plain ampersands in URLs embedded in my > pages, and thus far I have never encountered any problems. > But maybe I've just been lucky... ;-) Good for you :-) Regards, Gisle
Re: Templating System (preview Embperl 2.0)
"Paul J. Lucas" <[EMAIL PROTECTED]> writes: > On Fri, 28 Jul 2000, Gerald Richter wrote: > > > As far as I understand you you use mmap to read in the source file, is this > > correct? > > Yes. > > > If this is true, then it will not make much difference, because reading in > > the source is only a very small piece of all the time that it takes to > > generate the output from a dynamic page. > > I suggest you do some benchmarks. I have, albeit many months > ago. If I recall correctly, I took Yahoo's home page and ran > it through my HTML Tree and that of Gisle Aas: HTML was about > 7-8 times faster. That does not show that mmap is superior. I have not been able to build your module on my system, so I have not been able to set up any benchmarks myself. My guess is that you compared your module parsing speed with that of the HTML::TreeBuilder module? Could you please be specific with what versions of the modules you compared? Perhaps also post the benchmark code you used. Also tell me what HTML::Parser you had installed. Did you use HTML-Parser-3? If your module is that much faster than the basic HTML::Parser then I must be doing something very wrong. > > My point was, that the C implementation of parsering and DOM tree > > storage/caching, is much faster (uses much less memory) then doing the same > > in Perl. > > ...and faster still with mmap(2). I don't believe mmap buys you any significant. And it has the drawback that you can't parse from pipes or sockets. I made this little test program. I am not able to measure mmap to be faster than fread on my system. I am testing with Yahoo's home page. Do you get different numbers? Regards, Gisle --->8 #include #include #include #include #include #include void with_mmap(char *file) { struct stat sbuf; int fd; void* area; char* c; int size; unsigned int checksum; fd = open(file, O_RDONLY); fstat(fd, &sbuf); size = sbuf.st_size; area = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0); /* read the mapped area */ checksum = 0; c = area; while (size--) { checksum += *c++; } munmap(area, sbuf.st_size); printf("%s: sum=%x\n", file, checksum); } void with_fread(char *file) { FILE* f = fopen(file, "r"); char buf[32*1024]; unsigned int checksum; size_t n; checksum = 0; while ( (n = fread(buf, 1, sizeof(buf), f))) { char *c = buf; int n_orig = n; while (n--) checksum += *c++; if (n_orig < sizeof(buf)) break; } fclose(f); printf("%s: sum=%x\n", file, checksum); } int main(int argc, char** argv) { int i; void (*f)(char*); if (argc <= 1) { fprintf(stderr, "Missing type\n"); return -1; } if (strcmp(argv[1], "mmap") == 0) f = with_mmap; else if (strcmp(argv[1], "fread") == 0) f = with_fread; else { fprintf(stderr, "Bad type '%s'\n", argv[1]); return -1; } for (i = 2; i < argc; i++) { f(argv[i]); } return 0; }
Re: [performance/benchmark] printing techniques
Stas Bekman <[EMAIL PROTECTED]> writes: > And the results are: > > single_print: 1 wallclock secs ( 1.74 usr + 0.05 sys = 1.79 CPU) > here_print:3 wallclock secs ( 1.79 usr + 0.07 sys = 1.86 CPU) > list_print:7 wallclock secs ( 6.57 usr + 0.01 sys = 6.58 CPU) > multi_print: 10 wallclock secs (10.72 usr + 0.03 sys = 10.75 CPU) > > Numbers tell it all, I<'single_print'> is the fastest, 'here_print' is > almost of the same speed, 'single_print' and 'here_print' compile down to exactly the same code, so there should not be any real difference between them. -- Gisle Aas
Re: Novel technique for dynamic web page generation
"Paul J. Lucas" <[EMAIL PROTECTED]> writes: > On 28 Jan 2000, Randal L. Schwartz wrote: > > > Have you looked at the new XS version of HTML::Parser? > > Not previously, but I just did. > > > It's a speedy little beasty. I dare say probably faster than even > > expat-based XML::Parser because it doesn't do quite as much. > > But still an order of magnitude slower than mine. For a test, > I downloaded Yahoo!'s home page for a test HTML file and wrote > the following code: > > - test code - > #! /usr/local/bin/perl > > use Benchmark; > use HTML::Parser; > use HTML::Tree; > > @t = timethese( 1000, { >'Parser' => '$p = HTML::Parser->new(); $p->parse_file( "/tmp/test.html" );', >'Tree' => '$html = HTML::Tree->new( "/tmp/test.html" );', > } ); > - > > The results are: > > - results - > Benchmark: timing 1000 iterations of Parser, Tree... > Parser: 37 secs (36.22 usr 0.15 sys = 36.37 cpu) > Tree: 7 secs ( 7.40 usr 0.22 sys = 7.62 cpu) > --- > > One really can't compete against mmap(2), pointer arithmetic, > and dereferencing. That's because you fall back to version 2 compatibility when you don't provide any arguments to the HTML::Parser constructor. The parser will then make useless method calls for all stuff it finds, and method calls with perl are not as cheap as I would wish. - test code - use Benchmark; use HTML::Parser; timethese( 1000, { 'Parser' => '$p = HTML::Parser->new(); $p->parse_file( "./index.html" );', 'Parser3' => 'HTML::Parser->new(api_version => 3)->parse_file( "./index.html" );' } ); - $ lwp-download http://yahoo.com Saving to 'index.html'... 11.6 KB received in 2 seconds (5.8 KB/sec) $ perl test.pl Benchmark: timing 1000 iterations of Parser, Parser3... Parser: 30 wallclock secs (29.31 usr + 0.20 sys = 29.51 CPU) Parser3: 2 wallclock secs ( 1.39 usr + 0.17 sys = 1.56 CPU) ...but this is kind of a useless benchmark, as it does not do anything. Regards, Gisle
HTML-Parser-XS-2.99_08 performance numbers
Dave Hodgkinson <[EMAIL PROTECTED]> writes: > Do you have any numbers on speed? These are some examples: --- #!/usr/bin/perl $file = "/local/doc/html-spec/html4.0.1/interact/forms.html"; print "Parsing ", -s $file, " bytes\n"; $doc = `cat $file`; use HTML::Parser (); use Time::HiRes qw(time); $before = time; for (1..10) { HTML::Parser->new->parse_file($file); } printf "parse_file: %.1f seconds\n", time - $before; $before = time; for (1..10) { HTML::Parser->new->parse($doc)->eof; } printf "parse: %.1f seconds\n", time - $before; __END__ Prints: Parsing 138204 bytes parse_file: 7.1 seconds parse: 87.3 seconds when using HTML-Parser-2.25, and Parsing 138204 bytes parse_file: 2.3 seconds parse: 2.1 seconds when using HTML-Parser-XS-2.99_08. We get a speedup of 41(!) times when parsing from an inline string and 3 times when using the parse_file method. This also shows that the old parser was very bad at breaking up large chunks. The 'parse_file' method feeds the document in small chunks. --- #!/usr/bin/perl $file = "/local/doc/html-spec/html4.0.1/interact/forms.html"; print "Parsing ", -s $file, " bytes\n"; use HTML::LinkExtor (); use Time::HiRes qw(time); $count = 0; $before = time; for (1..10) { HTML::LinkExtor->new(sub {$count++})->parse_file($file); } printf "Found $count links in %.1f seconds\n", time - $before; __END__ Prints: Parsing 138204 bytes Found 8770 links in 8.3 seconds when using HTML-Parser-2.25, and Parsing 138204 bytes Found 8770 links in 2.0 seconds when using HTML-Parser-XS-2.99_08. That is 4 times speedup. --- #!/usr/bin/perl $file = "/local/doc/html-spec/html4.0.1/interact/forms.html"; print "Parsing ", -s $file, " bytes\n"; use HTML::TokeParser (); use Time::HiRes qw(time); $count = 0; $before = time; for (1..10) { my $p = HTML::TokeParser->new($file); while (my $t = $p->get_token) { $count++; } } printf "Processed $count tokens in %.1f seconds\n", time - $before; __END__ Prints: Parsing 138204 bytes Processed 80140 tokens in 11.5 seconds when using HTML-Parser-2.25, and Parsing 138204 bytes Processed 80140 tokens in 3.3 seconds when using HTML-Parser-XS-2.99_08. That is 3.5 times speedup. --- #!/usr/bin/perl $file = "/local/doc/html-spec/html4.0.1/interact/forms.html"; print "Parsing ", -s $file, " bytes\n"; use HTML::Parser (); use Time::HiRes qw(time); $count = 0; $before = time; if ($HTML::Parser::VERSION < 2.9) { { package MyParser; require HTML::Entities; @ISA=qw(HTML::Parser); sub text { my $t = HTML::Entities::decode($_[1]); $main::count++} } for (1..10) { my $p = MyParser->new; $p->parse_file($file); } } else { for (1..10) { my $p = HTML::Parser->new(text => sub {$count++}, decode_text_entities => 1, ); $p->parse_file($file); } } printf "Processed $count decoded text segments in %.2f seconds\n", time - $before; __END__ Prints: Parsing 138204 bytes Processed 32430 decoded text segments in 9.32 seconds when using HTML-Parser-2.25, and Parsing 138204 bytes Processed 32430 decoded text segments in 0.76 seconds when using HTML-Parser-XS-2.99_08. This shows that the new library also provide some new ways to do things that is good for speed. Here we get 12 times speedup. --- But, then new library loads slighty slower (it has a dynamic C-library to link): $ time for i in $(range 50); do perl -MHTML::Parser -le 'HTML::Parser->new; print $HTML::Parser::VERSION'; done Takes 3.6 seconds with version 2.25 and 4.2 seconds with 2.99_08. That is 17% slower startup. This shouldn't matter much when you use it under mod_perl though :-) ['range' is a little tool I have that will print the numbers 1..50 in this case] These tests where made on a SuSE Linux box with enough memory and a 350 Mhz Pentium II processor. I expect the new parser to become a bit faster when I get to the point where I try to optimize it. Currently I am just trying to get all new features implemented. Regards, Gisle
Re: Trying not to re-invent the wheel
"Christian Gilmore" <[EMAIL PROTECTED]> writes: > I found that writing my own parser to fit my specific need was far > and away the fastest thing I could do. It really depends upon your > specific application. HTML::Parser is nice if you want to see the > structure of the document your parsing but is just too slow to use > for wresting particular tags from a document... True. This was the main reason I started work on a new XS based HTML::Parser a week ago. It should make much of the performance argument go away. Still, most of the HTML that I have ever needed to parse or manipulate is regular enough to make perl REs good enough. Since HTML::Parser is XS based now I'm also able to offer many more features without suffering performance. I have attached a message I sent to the <[EMAIL PROTECTED]> mailing list today describing what's new. Regards, Gisle I am now up to version 2.99_08 of the new HTML::Parser and I think it comes along nicely. As you might guess from the version number I am aiming for version 3.00 when I think it is ready for general use. I still encourage people to download it and test it out on various platforms (at least check that 'make test' says everything is ok). You can get it from: $CPAN/authors/id/GAAS/HTML-Parser-XS-2.99_08.tar.gz Compatibility with HTML-Parser-2.2x is now perfect as far as I can tell. The interfaces to all new features I still reserve the right to change until 3.00-time. There is still no documentation on the new things, but the following text attempts explain most of them: The main new feature is that instead of making a subclass you can just provide callbacks to be invoked when various elements are recognised. When one or more direct callbacks are provided, then no methods will be called. There is a new 'default' callback that is invoked with the text of everything that there is no other callback registered for. This might for instance be used to implement a simple comment stripper by code like this: HTML::Parser->new(comment => sub {}, # ignore default => sub { print $_[0] }, )->parse_file(shift); (I actually thought I was very clever when I realized how handy this would be, but later found out that XML::Parser already had exactly this feature. :-) Text handlers get an extra argument that is true if entities are already expanded in the text string passed. This was needed to handle
Re: MD5 risks?
Ben Bell <[EMAIL PROTECTED]> writes: > On Tue, Nov 09, 1999 at 10:24:39AM +0800, Trevor Phillips wrote: > > Another alternative is to get the MD5 base64 key to the URI. My query is, what > > is the chance of two URI's giving the same MD5? Is there any risk in it, or is > > MD5 guranteed to give unique ID's? (I know the risk would be SLIM, but how > > slim?) Is MD5 used regularly for this kind of thing? > Very slim :) something like 1/1000 billion, billion, billion, billion. > for a match to a particular key, though about 1/1 billion billion for > getting a collision in general. (In the region of 1 / (2^128) and 1 / (2^64) > respectively. > > If you tack on the length of the string your MD5ing as well then you're > pretty much safe. I don't think this buy you much extra safety. The length of URIs don't vary significantly compared to MD5. -- Gisle Aas