Re: make test failed for Perl module Crypt-SSLeay-0.65_02
* Jillapelli, Ramakrishna wrote: I am getting the following error while doing make test for Perl module Crypt-SSLeay-0.65_02 # make test PERL_DL_NONLAZY=1 /usr/bin/perl -MExtUtils::Command::MM -e test_harness(0, 'blib/lib', 'blib/arch') t/*.t t/00-basic.t ok t/01-connect.t .. ok t/02-live.t . Can't locate Try/Tiny.pm in @INC (@INC contains: /home/rj46/Crypt-SSLeay-0.65_02/blib/lib /home/rj46/Crypt-SSLeay-0.65_02/blib/arch /usr/opt/perl5/lib/5.10.1/aix-thread-multi /usr/opt/perl5/lib/5.10.1 /usr/opt/perl5/lib/site_perl/5.10.1/aix-thread-multi /usr/opt/perl5/lib/site_perl/5.10.1 .) at t/02-live.t line 4. BEGIN failed--compilation aborted at t/02-live.t line 4. t/02-live.t . Dubious, test returned 2 (wstat 512, 0x200) No subtests run You need http://search.cpan.org/dist/Try-Tiny/ but do note that the underscore in the name `Crypt-SSLeay-0.65_02` indicates that it is an experimental developer release for testing and not for deployment. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Freeing HTTP::Message from HTML::Parser dependency
* Christopher J. Madsen wrote: My repo is https://github.com/madsen/io-html but since it's built with dzil, I also made a gist of the processed module to make it easier to read the docs: https://gist.github.com/1623654 I took a quick look at HTTP::Message, and I think you'd just need to do elsif ($self-content_is_html) { require IO::HTML; my $charset = IO::HTML::find_charset_in($$cref); return $charset if $charset; } You're already doing the BOM and valid-UTF8 checks; all you need is the meta check, which is what find_charset_in does. It is not clear to me that the combination would actually conform to the HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32 BOMs, but as I recall the HTML5 proposal does not allow that. Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Freeing HTTP::Message from HTML::Parser dependency
* Christopher J. Madsen wrote: Dropping support for UTF-32 from HTTP::Message is a separate issue from removing HTML::Parser. I've got no comment on that. (It's not quite as black and white as that, HTML5 could be exempted in the algorithm, for instance.) Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Why make the check at all if you don't care if it's right? Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? Not as far as I can tell; links welcome though. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: libwww-6 versus libwww-5
* H.Merijn Brand wrote: When we (try to) communicate using SOAP with libwww-6.03, we get this error: error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure ... This is the relevant error code. My guess would be there is a problem with the certificates involved, but a quick search does not support that. You could try running `openssl s_client -connect example.com:443` (and possibly manually type some HTTP request after that) and see if that provides any clues. There are also environment variables you can set to control whether certificates are verified to rule that out as a problem. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Where is the mailing list archive
* Peng Yu wrote: I'm not able to find an archive for recent messages on libwww-perl mailing list. Does anybody know where is the archive? http://www.nntp.perl.org/group/perl.libwww/ is one archive. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: use constant, Perl question
* Gary Yang wrote: I have hard time of understanding “use constant”. I do not understand why place the “+” in front of constant variable. See the code below. In the for loop, It adds “+” (+kAWSAccessKeyId,). In the new function call, it adds the “+” (+RequestSignatureHelper::kAWSAccessKeyId = myAWSId). I read the perldoc of “use constant”. But, no clue. Can someone explain what it mean? Or point me any books or URLs I can read. I got the code below from some samples. If this mailing list is not the proper place to ask this question, please tell me which mailing list is best for this question. Thanks. This mailing list is dedicated to libwww-perl related issues, general questions about Perl programming would be better placed on forums like http://www.perlmonks.org/ or the comp.lang.perl.* newsgroups on Usenet. As for your question, many of the + in the script are redundant, so they are likely there for reasons of consistency. Sometimes you can use the operator to force a particular interpretation, for instance, print +('xxx'), ('yyy'); will print 'xxxyyy' but without the + you get 'xxx'. Similarily, $self-{+kRequestMethod} The `constant.pm` module installs subroutines for the constants you register, so this is read as a function call $self-{ kRequestMethod() } while without the + it would be read as a string literal like $self-{ kRequestMethod } I hope that helps, -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: How to ignore a named character reference in TreeBuilder?
* Webley Silvernail wrote: I have some XHTML in utf-8 that includes the named character reference #160; for non-breaking spaces. (That is a numeric, not a named, character reference.) The output is replacing the #160; with a non-printable character that is rendered in various agents as boxes or question marks enclosed in diamonds. (That like means you've not specified the character encoding properly.) I've tried adding an explicit decode/encode step and using HTML::Entities, but I've had no luck. Basically, I just want TreeBuilder to ignore the #160; references and pass them through. I believe you are looking for the $element-as_HTML($entities) parameter (see `perldoc HTML::Element` for details; set the parameter to a value that includes all the characters you want to be escaped in the output). -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: HTTP::Parser problem
* Gerhard Rieger wrote: I experience a reproducable problem with the HTTP::Parser module. I run the packaged Perl 5.10.1 on Ubuntu 10.4 Lucid. The problem appears with the packaged version 3.64-1 of libhtml-parser-perl as well as with the manually installed actual version 3.68. You seem to be talking about HTML::Parser, not HTTP::Parser. Problem: Under some circumstances the parser breaks short continuous text into two parts on a white space and invokes the text_handler callback for each of these parts. Especially while parsing tables this breaks the structure of the result. That would seem to work as intended; if you want everything as one string, then you could buffer it, or perhaps use unbroken_text(). -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Content-Disposition and utf8 filenames
* Bill Moseley wrote: I would like to use a user-supplied filename when returning a download (e.g. pdf). For example, might be filename=$title.pdf. But $title can include any character. It seems like support for this in browsers is spotty: See: http://greenbytes.de/tech/tc2231/ Is anyone aware of a way to set this header to allow utf8 filenames that is supported across browsers? No, as you can tell from the results there is no one way supported by all major browsers. The IETF HTTPbis Working Group is currently revising the specification for it, and the recommended way to do this will be the through the RFC 5987 style notation, where you can also specify a fall- back value by using both the filename* and filename parameters. But no silver bullet there. Also, my assumption is HTTP::Headers expect encoded values -- that is the values are octets not characters and so should always encode( 'US-ASCII' ) the value. (That is generally correct, yes). I just tried with Google Apps and it seems they turn any non A-Za-z into an underscore. Not sure that means they just didn't try hard or if they felt it was not possible to use non ASCII characters in suggested filenames. As I understand it, Google is currently investigating switching to RFC 5987 style encoding in some of their applications. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Downloading TSV files
* Meir Guttman wrote: I have also to warn you that this site is notoriously finicky. It is probably overloaded (and overwhelmed) by users. Lately I experience many types of false downloads and other problems that require repeated downloading trials. (I'll supply the file integrity check.) Well, that might be the only problem, I had no trouble downloading the file using `wget --save-cookies=cookies --load-cookies=cookies --keep- session-cookies ...` except that you apparently first have to open the HTML version of the table and then immediately thereafter download the TSV version (that might take a few tries, whether I use wget or the web browser). -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: 'POST' method leads to 411 Length Required
* Aaron Naiman wrote: $request = new HTTP::Request 'POST', 'http://www.google.com'; $response = $ua-request($request); You are using POST but are not posting any data, and apparently the server ... An Error Occurred 411 Length Required ... expects you to post some data when using POST. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: LWP content encode
* stefano tacconi wrote: I'm writing a simple script to download some web pages on the net. Using LWP it's works fine, but how can I get html page with strange characher? You are probably looking for HTML::Encoding, the script in the synopsis shows how to decode the content; HTTP::Response::Encoding seems to be a rather crude module that is unaware of HTML semantics. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Re: libwww-perl-5.810 (HTTP::Message content not bytes)
* Shaun wrote: Don't allow HTTP::Message content to be set to Unicode strings. I'm going to assume this change above is whats breaking my scripts. My script is using NicToolServerAPI.pm which in turn uses LWP::UserAgent. The script dies out with the error above when I run it. I'm not sure what has to be done, my guess is that the script is POST'ing the URL in plain text and it needs to be in some other format. I've snipped out a few pieces of code in hopes that somebody here might be able to give me a quick fix. Presumably you have to clear the string you are passing in of the UTF-8 flag, e.g. by using Encode::encode or perhaps Encode::_utf8_off. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
* David Nesting wrote: The most complete implementation imaginable would start with at least these: text/html (html-specific rules) text/xml (xml-specific rules) text/* (general-purpose text rules) application/*+xml (xml-specific rules) HTML::Encoding does all of these, except text/* (for which there are no rules beyond checking the charset parameter, though you might also try to check for a Unicode signature at the beginning, which almost always indicates the Unicode encoding form, HTML::Encoding can do both but is not designed to do that for arbitrary types). On the other hand, I'm less convinced now that dipping into the HTML or XML content to figure out the proper encoding is necessarily the proper thing to do here. My complaint about LWP::Simple was that the HTTP Content-Type (charset) information is lost by the time it gets to the caller. Well that is necessarily so to keep the interface simple. Going from LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not warrant adding functionality to LWP::Simple. I could see a case then for dealing with text/* only and returning octets for everything else, since text/* is the only media type that has character encoding details in the HTTP headers. Actually that is not the case, there are plenty of, say, application/* formats, like the XML types, that carry encoding information in the header, without replicating it in the content (likewise, information in the content may not be replicated in the header, and the two may contra- dict each other). Yes, it's still their fault for not coding a robust application, but helping them do that is I think still a valid goal, if we can do it safely. Well, automagic decoding of content cannot be added to LWP::Simple with- out some opt-in switch as that would break a lot of programs, and if you require some opt-in, you might as well require switching the module. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
* David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? Generally speaking, this is rather difficult as some content may not be textual at all, and textual formats vary in how applications are to de- tect the encoding (e.g., XML has different rules than HTML, text/plain has no rules beyond looking at the charset parameter, and so on). If you want a general-purpose solution, a good start would be a module taking a HTTP::Response object and detecting the encoding, possibly decoding it on request. I'd be happy to help work on some of this, but the fact that I see no use of character encodings within libwww makes me wonder if this is more of a policy decision not to do it. There was a bit of a discussion to somehow use HTML::Encoding for some parts of it, which pretty much solves the problem for HTML and XML, cf the list archives. Help on improving HTML::Encoding would be welcome, I have little time to work on it at the moment. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote: For most uses of libwww, developers do little with character encoding. Indeed, for general-case use of LWP::Simple, they can't, because that information isn't even exposed. Has any thought gone into doing this internally within libwww, so that when I fetch content, I get back text instead of octets? If you have the response object: $response-decoded_content; That removes content encodings like gzip and deflate, but David is asking about character encodings like utf-8 and iso-8859-1. Content encodings are applied after character encodings. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Doing character encoding/decoding within libwww?
* Bill Moseley wrote: sub decoded_content { $content_ref = \Encode::decode($charset, $$content_ref, Encode::FB_CROAK() | Encode::LEAVE_SRC()); The documentation I re-read earlier even says that... This is still a far cry from being generally useful though, it only works for text/* and only if the encoding is specified in the header, or the format does not use some kind of inline label that is inconsistent with the default. Most of the time this is not the case, however. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Parsing q-values
* Bill Moseley wrote: Is there any existing code for parsing the q-values. For example, to get a list of Accept-Encoding values ordered by their q values? HTTP::Negotiate? -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Base64 data: URLs vs url encoding
Hi, print URI-new('data:;base64,QmpvZXJu')-data; print URI-new('data:;base64,%51%6D%70%76%5A%58%4A%75')-data; I think this should both print Bjoern, but in the second case the module returns garbage (URI 1.35). regards, -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: HTML-Parser: storing into a DB words with special chars
* thomas Armstrong wrote: The document contains special characters ('música española' for instance), and after storing it into de DB, I get this word: música española How do you get it? You need to ensure that the database software supports Unicode, that you properly store the data, and that you properly retrieve and view the data. The string above is UTF-8 but interpreted as in some other encoding (Windows-1252 or ISO-8859-1 for example). That's not a HTML::Parser issue. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Mailing list archives 2001-2005?
* John J Lee wrote: I'm sure somebody has one on the web. Can anybody point me to it? There are lots of links to old sites that stop in 2001, and GMANE seems to start in 2005, but I can't find anything between 2001 and 2005. http://www.nntp.perl.org/group/perl.libwww -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: libwww throws warning when used with HTML::Parser 3.44
* Karl DeBisschop wrote: I just plugged in HTML::Parser 3.44 on my FC2 servers in order to handle utf-8 encoded content. (Boy was I glad to see that was available) But when running a robot, LWP::Protocol emits a warning as it works because the content stream is not decoded into perl's native character set. See http://www.nntp.perl.org/group/perl.libwww/6017 and the relevant thread for a recent discussion on this. Your patch has a number of problems, parsing the encoding out of the charset parameter is a bit more difficult than your regular expression (e.g., the encoding name might be a quoted-string as in charset=utf-8), the routine would now croak in common cases such as an unsupported character encoding, and it fails to deal with encodings such as ISO-2022-JP that maintain a state (see Encode::PerlIO) or where characters might be longer than one octet such as UTF-8 (consider one chunk has Bj\xC3 and the other chunk has \xB6rn, you need to know the \xC3 when decoding the \xB6). -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Downloading a page compressed
* Octavian Rasnita wrote: Please tell me how can I use the $request-header() in order to request a page in compressed format (with gzip). HTTP/1.1 uses the TE and Accept-Encoding headers to specify that the client supports gzip compression. LWP should take care of the TE header automatically if the relevant modules are installed, in order to specify the Accept-Encoding header you can use $request-header(Accept_Encoding = 'gzip') Note that LWP does not automatically remove the gzip compression in this case and that there is no gurantee that the resource will indeed be gzip compressed by the server. If it is, this should be indicated in the Content-Encoding header. For some resources it is however common that it is not indicated in the response that the resource is compressed, see http://www.w3.org/mid/[EMAIL PROTECTED] for some details. -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: decoded_content
* Gisle Aas wrote: The current $mess-decoded_content implementation is quite naïve in it's mapping of charsets. It need to either start using Björn's HTML::Encoding module or start doing similar sniffing to better guess the charset when the Content-Header does not provide any. http://search.cpan.org/dist/HTML-Encoding/. I very much welcome ideas and patches that would help here. The module is currently just good enough to replace the custom detection code in the W3C Markup Validator check script (which is the basic motivation of the module ever since) and in that pretty much ad hoc... I do indeed think that the libwww-perl modules would be a better place for much of the functionality. I also plan to expose a $mess-charset method that would just return the guessed charset, i.e. something similar to encoding_from_http_message() provided by HTML::Encoding. A $mess-header_charset might be a good start here which just gives the charset parameter in the content-type header. This would be what HTML::Encoding::encoding_from_content_type($mess-header('Content-Type')) does. HTTP::Message would be a better place for that code as the charset parameter is far more common than just HTML/XML (all text/* types have one, for example). The same probably goes for other things aswell such as the BOM detection code in HTML::Encoding. Another problem is that I have no idea how well the charset names found in the HTTP/HTML maps to the encoding names that the perl Encode module supports. Anybody knows what the state here is? Things might work out in common cases, but it's not quite where I think it should be, I've recently started a thread on perl-unicode about it, http://www.nntp.perl.org/group/perl.unicode/2648; I found that using the I18N::Charset is needed in addition to Encode and that I18N::Charset (still) lacks quite a number of mappings (see the comments in the source of the module). When this works the next step is to figure out the best way to do streamed decoding. This is needed for the HeadParser that LWP invokes. One problem here are stateful encodings such as UTF-7 or the ISO-2022 family of encodings as Encode::PerlIO notes (and attempts to work around for many encodings). For example, the code you posted to perl-unicode (re incomplete sequences) would fail for UTF-7 Bj+APY-rn if it happens to split the string after Bj+APY which would be a complete sequence but the meaning of the following -rn depends on the current state of the decoder which decode() does not maintain, so it might sometimes decode to Bjö-rn and sometimes to Björn which is not desirable (it might have security implications, for example). I am not sure whether there is an easy way to use the PerlIO workarounds without using PerlIO. I've tried using PerlIO::scalar in HTML::Encoding, but http://www.nntp.perl.org/group/perl.unicode/2675 it modifies the scalar on some encoding errors and I did not investigate this further. Maybe Encode should provide a simpler means for decoding possibly incom- plete sequences... Also, HTML::Parser might be the best blace to deal at least with the case where the (or an) encoding is already known so it would decode the bytes passed to it itself, I would then probably replace my poor custom HTML::Encoding::encoding_from_meta_element with HTML::HeadParser looping through possible encodings (probably giving up once that worked out, it would currently decode with UTF-8 and ISO-8859-1 for most cases which is quite unlikely to return different results...) -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: How can I become universal utf/unicode
* J and T wrote: Sometimes when fetching a document you have no idea the encoding and sometimes you do. What I want to know is how do I convert the incoming Web page regardless of encoding to UTF-8 as well as encode entities to something like Aacute (for keyword matching)? You need to determine the character encoding of the document and then transcode the byte stream to from the determined encoding to UTF-8. There are a number of rules how to determine the character encoding of text/html resources, these are unfortunately underspecified and contra- dict each other and, worse, most documents do not have any encoding information which means you would have to guess an encoding, or are encoded using a different encoding than what they declare, in these cases you would need to either reject the document or attempt to recover from such problems. There is a HTML::Encoding module on CPAN that can help you to determine the encoding, but there are probably some bugs and the interface will most certainly change once I get around to look at it again (I haven't done so for years). It should however give a good starting point. If that module (or similar code) does not yield in encoding information, there is Encode::Guess which helps a bit to determine the encoding. More sophisticated solutions than Encode::Guess are, AFAICT, not available on CPAN. You could try to interface with or reuse code from some web browsers, MSHTML for example would perform byte pattern analysis to determine an encoding. A simpler approach would be to fallback to e.g. Windows-1252, what you would do depends on how good you would like the results to be. Over at the W3C Markup Validator we currently attempt to use information as HTML::Encoding would report it and if that fails, fall back to UTF-8 and if the document is not decodable as UTF-8, the document is rejected. Which means that lots of documents are rejected. Once the input is UTF-8 encoded, you can use HTML::Parser as usual. I am not sure whether it sets the UTF-8 flag, but either way, it should report the data in the same encoding so you could set the flag later.
Re: URI::file not RFC 1738 compliant?
* Gisle Aas wrote: As far as I can tell, RFC 1738, section 3.10, as well as the BNF in section 5 explicitly say that file: URI must have two forward slashes before the optional hostname, followed by another forward slash, and then the path. RFC 1738 is becoming a bit stale. I do believe that the intent is for 'file' URIs to also follow the RFC 2396 syntax for hierarchical namespaces which clearly states at the 'authority' is optional. http://www.ietf.org/internet-drafts/draft-hoffman-file-uri-00.txt has been published a few weeks ago, and there is an (ongoing?) discussion http://lists.w3.org/Archives/Public/uri/2004Aug/thread.html#58 on the future of the draft. Sure. Especially if I'm told about more apps that can't inter operate with authority-less-file-URIs. I might want to make it an option. A C# program with Console.WriteLine((new Uri(file:///x)).LocalPath); Console.WriteLine((new Uri(file:///x:/y)).LocalPath); Console.WriteLine((new Uri(file:///x|/y)).LocalPath); Console.WriteLine((new Uri(file://x/y)).LocalPath); Console.WriteLine((new Uri(file:/x)).LocalPath); running on Microsoft .NET 1.1 would print \\x x:\y x:\y \\x\y and then Unhandled Exception: System.UriFormatException: Invalid URI: The format of the URI could not be determined. for the file:/x case.
Re: RSS for this list
* Sean M. Burke wrote: As you probably know, this mailing list is archived at http://nntp.x.perl.org/group/perl.libwww news://nntp.perl.org/perl.libwww But it might interest you to know that there's now an RDF feed for this mailing list: http://nntp.x.perl.org/rss/perl.libwww.rdf It's still a bit experimental, so let me know if you run into any trouble with it. Are these U+00ACs intentional?
Re: RSS for this list
* Thurn, Martin wrote: What application do I use to read/process an RDF file? http://search.cpan.org/search?query=rss http://search.cpan.org/search?query=rdf
Re: Fw: LWP does not do JavaScript! (was Re: Fw: Can't navigate to URL after login)
* Sean M. Burke wrote: At 20:20 2002-08-09 -0500, Tom wrote: [...]Three of them are already available on CPAN, so I'm currently focusing on implementing the HTML::DOM module. Any comments? Yes: I'm happy someone's actually doing this! For ages, this JavaScript thing has been one of those projects where people keep saying Boy, it'd be nice if SOMEONE (and not me!) went and did that. Guilty... Tom, are you going to implement HTML DOM Level 1 or Level 2? The latter is currently a working draft and has only minor changes, but those changes are necessary in order to support XHTML and I would like to use a HTML::DOM module for both, HTML and XHTML files. I'd love to change my HTML::DOMbo module to build HTML::DOM trees instead of XML::DOM trees, since I've heard that the latter class isn't well supported. Please be sure to provide a PerlSAX2 Reader and a PerlSAX2 Writer in order to enable XML modules to use your tree.
Re: Doesn lynx support SSL
* Siva Namburi wrote: Does anybody know if lynx would support ssl handshakes. I mean can you browse sites which have ssl enabled using lynx. Take a look at http://lynx.browser.org, there you can find how to enable SSL support in Lynx. PS: This mailing list is dedicated to the libwww-perl library, Lynx is not part of that lib and hence offtopic.
Re: Error 404 but file exists
* regis rampnoux wrote: Something strange for me: %HEAD http://e.neilsen.home.att.net 404 Not found Connection: close Date: Mon, 08 Oct 2001 19:18:09 GMT Server: Netscape-Enterprise/3.6 SP2 Content-Type: text/html Client-Date: Mon, 08 Oct 2001 19:18:09 GMT Client-Peer: 204.127.135.37:80 But if I use a get it works! Why? The server is broken, it's no LWP problem. -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
RFC: HTML::Encoding
Hi, I've written up a module that collects encoding informations for (X)HTML files. (X)HTML files may carry encoding information in 1. the higher-level protocol (e.g. the Content-Type headers charset parameter in HTTP and MIME) 2. the XML declaration (for XHTML documents) 3. the byte order mark at the beginning of the file 4. meta elements like meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' At user option it tries to extract the explicit given informations information from these instances. After that process it sorts the list according to the order above, in list context it returns the list, in scalar context it returns the first encoding in the list (i.e. the encoding the user agent must use to parse the document). This looks like #!perl -w use strict; use warnings; use LWP::UserAgent; use HTML::Encoding ''; my $r = LWP::UserAgent-new-request( HTTP::Request-new(GET = 'http://www.w3.org/')); print scalar HTML::Encoding::get_encoding check_bom = 1, check_xmldecl = 1, check_meta= 1, headers = $r-headers, string= $r-content This would currently print out 'us-ascii' as http://www.w3.org/ returns Content-Type: text/html;charset=us-ascii; in list context this would return [ { source = 4, encoding = 'us-ascii' }, { source = 1, encoding = 'us-ascii' }, ] since the page has also a meta header meta http-equiv='Content-Type' content='text/html;charset=us-ascii' / The POD says: [...] The source value is mapped to one of the constants FROM_META, FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these constants solely into your namespace or using the :constants symbol, e.g. use HTML::Encoding ':constants'; [...] This is usable if you want to check if there is a mismatch between the declared encodings. Some issues that came to my mind while writing this module: * HTTP::Headers should provide some information whether LWP::UserAgent already parsed the header section of the HTML file; so I wouldn't need to do the same thing again. currently one cannot distinguish if there were multiple Content-Type: headers in the original response or if they come from meta elements * HTML::Encoding currently uses HTML::Parser to extract the meta element if version 3.21 or later is available (maybe I'll switch to HTML::HeadParser ...) The problem is, that HTML::Parser is AFAIK currently unable to process documents encoded in some encoding that is not compatible with US-ASCII (UTF-16BE for example) I think it is out of scope of HTML::Encoding to recode the given string to some US-ASCII compatible encoding (that'd be UTF-8) in order to parse the document; this should be done by HTML::Parser using some encoding parameter. Personally I'd say that HTML::Parser should only output UTF-8 encoded characters as XML::Parser does, but this will certainly clash with current users who expect to get ISO-8859-1 or something like that out of it... Is it likely that HTML::Parser incorporates such a feature using the Unicode::* modules or Text::Iconv or whatever is currently available? * Is the currently really no module that does what HTML::Encoding is supposed to do? In general you have to use the module everytime you try to do anything with an HTML document; hmm, maybe western people got too used to ISO-8859-1... The current version can be found at http://www.websitedev.de/perl/HTML-Encoding-0.01.tar.gz You'll currently need Perl 5.6.0 to use it. The file currently lacks of a proper README and test files... Is the module name appropriate? Any other comments or suggestions? I greatly appreciate them :-) Thanks for your time, -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: Bug in LWP::UserAgent?
* Gisle Aas wrote: I believe that this behavior is due to the UserAgent because using telnet I do not get multiple 'Content-Type' definitions in the response from the server. The link at the top of this message has more information on the matter. I am working out a workaround in the proxy server, but I wonder if this is not something that should be addressed in the libwww-perl codebase. What do you think it should do? HTTP::Headers should have some method to determine whether the body was parsed or not. Not only usable in this case. -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
HTML::Parser::Tidy namespace
Hi Gisle, Hi list, maybe you have heard of HTML Tidy, a free utility by Dave Raggett to repair, cleanup and pretty-print HTML and XHTML documents. Please refer to http://www.w3.org/People/Raggett/tidy/ for more information on HTML Tidy. HTML Tidy is currently maintained by a group of developers including myself at Sourceforge. One of our goals is to create a free-standing C-library out of Tidy to ease it's reuse in other applications, see http://sourceforge.net/projects/tidy/ for more information on this project. I'm going to write an Perl XS interface to this library [1] and the best name I could think of was HTML::Parser::Tidy HTML::Tidy is already taken and something like HTML::PerlTidy implies, that this cleans up some perl code. I'll try to provide an interface compatible [2] to HTML::Parser 3.x so that applications built upon HTML::Parser will be able to use Tidy as an alternative. My current module provides a simple (XML::Parser::Perl)SAX interface so that I can use the module to build up a DOM tree for e.g. XML::DOM or XML::XPath. I'm currently considering if it's worth to expose the whole the DOM-like functions of HTML Tidy... However, I'd like to ask if the module name is ok for you, Gisle, and for others or if one of you as a better suggestion. The module will be maintained at http://sourceforge.net/projects/ptidy (where you'll currently find nothing but empty pages :-) If some people are interested in this project, feel free to subscribe to the [EMAIL PROTECTED], but be aware that I'll bug you with questions about the interface/documentation/etc.pp. ;-) [1] I already did something like that in april this year, but I came to the conclusion, that HTML Tidy should be a real library to use all it's features in other applications easily, so we started that project. [2] To some extent, some features aren't possible. Thanks for your comments, -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: HTML::Parser::Tidy namespace
* Gisle Aas wrote: What is HTML::Tidy? I could not find anything that use that name. There is currently no such module on CPAN. Weired. I was _very_ sure I've seen such a module in version 0.0x one day in some CPAN directory. Hm, maybe this was just some private project. However, that's great, I'll take HTML::Tidy then :-) Thanks. -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: HTML::Parser::Tidy namespace
* [EMAIL PROTECTED] wrote: Bjoern, Very interesting project -- I've been lurking around watching the libTidy development over the last few weeks, and have wished for something like this. HTML::Clean just doesn't provide the capabilities I need. HTML::Clean does even evil things like replacing strong with b because it's shorter :-( Have you considered using Inline::C rather than native XS? That seems to be the way to go these days. Is it? Can you briefly explain what the advantages are? -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: HTML::Parser::Tidy namespace
* Sean M. Burke wrote: So the C-Tidy-library builds a document tree for some HTML file, and then SAX walks the tree so that you can, via Perl, build a new (in-Perl) tree for it, using the tree library of your choice (XML::DOM, XML::Element, or even some crazy thing called HTML::Element) ? Yes. Something similar goes for HTML::Parser-style events. This is one reason, why HTML::Tidy could not support all HTML::Parser events, e.g. informations like 'offset', 'line', 'column' or 'tokens' get lost in the parsing/cleanup process. Ok, it would be possible by making a lot of changes in Tidy, but I don't think it's worth the effort, Tidy's power _is_ the clean-tree generation. I dimly (mis?)remember looking at Tidy's internals months ago, and I think I remember that it stored everything as double-byte Unicode strings -- so I presume that those get UTF8ified (and tagged as such) when passed to Perl, right? Tidy stores all character data as UTF-8 encoded char*s. They will be passed as UTF-8 to Perl (tagged as such via SvUTF8_on()) or, for the pretty-printer, in your desired encoding (if supported). Does it deal nicely with non-UTF8 non-Latin-1 input encodings? Currently tidy supports * us-ascii * * iso-8859-1 * * windows-1252 * mac-roman * the iso-2022 family * * utf-8* [*] denotes supported output encoded I'm not sure about UTF-16, if it doesn't support it, I'll add support for it. We have feature requests for * ShiftJIS * BIG5 You have, however, to declare what encoding you are using. One might use Unicode::Map8 or Text::Iconv to convert strings to utf-8 before passing them to Tidy. -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: IRI Implementation
* John Stracke wrote: Gisle Aas wrote: But I don't really understand how IRIs solve any problem if you still can't use non-ASCII characters in host names. Don't you want to be http://björn.höhrmann.de? :-) ... and björn@höhrmann.de, yes. I must have these addresses, that's what i18n is all about, isn't it? ;-) It's coming; see http://www.ietf.org/html.charters/idn-charter.html. (There are people trying to jump the gun, but it turns out to be a hard problem. For example, should björn.höhrmann.de be equivalent to bjorn.hohrmann.de? Or consider Hebrew, where vowels are not letters, and often omitted; should two domain names that differ only in the vowels be equivalent?) In the meantime, IRIs let you have non-ASCII in the file names, at least. Better: in the path, query and fragment component. IRIs solve a _very_ common problem. Consider a search engine with a method='get' form. If you use this search engine to search for e.g. 'Björn Höhrmann' how to encode the URI? And if some CGI script receives the data, how to decode it if one doesn't know what character encoding was used to encode the characters? There is currently no definition how to handle this case. Most user agents use the chosen (i.e. in most cases the encoding of the current (X)HTML document) to encode the characters but since HTML doesn't define any default encoding, you have still no clue what happens to your queries. -- :: no signature today :: in memoriam Douglas Adams ::
RFC 2732 implementation in URI.pm
Hi, RFC 2732 extends RFC 2396 to accept literal IPv6 addresses, does URI.pm implement this RFC? -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
IRI Implementation
Hi Gisle, Hi List, At [1] you can currently find the latest Internet Draft for IRIs (Internationalized Resource Identifiers). IRIs are similar to URIs but the IRI syntax allows far more ISO/IEC 10646 (Unicode) characters in the grammer, while URIs are limited to US-ASCII characters. Each URI is supposed to be a valid IRI. Is there any ongoing development to support IRIs in Perl? I don't think so, so I suggest to add support hereby. I'm curious to know, how an implementation could look like. One could say that URIs are a subset of IRIs, so it would be reasonable to consider moving current URI processing code to an generic IRI parser and make the URI stuff a derieved class of some IRI module. What about that? [1] http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: RFC 2732 implementation in URI.pm
* Gisle Aas wrote: Bjoern Hoehrmann [EMAIL PROTECTED] writes: RFC 2732 extends RFC 2396 to accept literal IPv6 addresses, does URI.pm implement this RFC? No. $ perl -MURI -le 'print URI-new(http://[::192.9.5.5]/ipng;)' http://%5B::192.9.5.5%5D/ipng Is anybody using URIs of this form yet? Operating systems just started to handle IPv6 (latest FreeBSD, experimental update for Windows 2000, etc.) so using them isn't easy yet, but several applications like Apache 2.0 or Apache Tomcat can already handle such URIs. Is there agreement that following this proposal is a good thing? Other specifications make use of (or at least reference) this RFC like XML 1.0, XML Base, XML Schema, XML DSiG and the upcoming IRI Standards Track, so I think yes, they agree that this is a good thing [tm]. -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: RFC 2732 implementation in URI.pm
* Gisle Aas wrote: As far as I can tell all that is needed for RFC 2732 support is to add [ and ] to reserved characters, i.e. this patch: Do you think there should be more? I'll shout when it isn't sufficient :-) BTW, how does XML 1.0 reference RFC 2732? You'll see it twice in section 4.2.2. and as part of appendix 2, see http://www.w3.org/TR/REC-xml -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: Requests with Transfer-encoding: chunked
* Gisle Aas wrote: Does anybody know about live servers that allow POSTing with 'Transfer-encoding: chunked' in the request? I want to test the support I added to LWP. Anybody knows about servers that implement 'Transfer-Encoding: deflate' (or gzip) and will apply it to the response if I send the approate TE:-header in the request? Jigsaw, http://jigsaw.w3.org/ / http://www.w3.org/Jigsaw/ Apache2, http://httpd.apache.org (maybe with mod_gzip) Jigsaw test case fot the second one: http://jigsaw.w3.org/HTTP/TE/foo.txt (see http://jigsaw.w3.org/HTTP/ for an overview) -- Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Re: Bug Report
* Laurent Simonneau wrote: Why the character '|' is converted to '%7C' in URLs ? RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt) defines it as an unwise character, they must be escaped in URIs. Exemple : libwww-perl send : GET http://www.lycos.fr/cgi-bin/nph-bounce?LIA14%7C/service/sms/ HTTP/1.0 And the server reply a 404 not found error. That's a bug, maybe Lycos people should get better software, libwww-perl behaves correctly. -- Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 ° Telefon: +49(0)4667/981028 ° http://bjoern.hoehrmann.de 25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus
Re: HTML::TreeBuilder method/madness. Was: HTML::Tagset: p_closure_barriers
* Sean M. Burke wrote: As far as I understand SGML, A. SGML is a family of markup languages where, when (start|end)-tag omission is enabled, nothing can be parsed witout a DTD. SGML is a language to define markup languages. If start end/or end tags are omitted, you must know the content model of the element. The content model is defined in the DTD. If your application stores the content model in a different format, you don't need a DTD. HTML::Tagset does it this way. It does this in a manner i don't like, i.e. not conforming to the HTML 4.01 DTDs, e.g. %HTML::Tagset::optionalEndTag only stores those elements, where the end tag is 'safely' omittable. B. For basically every SGML document, a complete (!) DTD exists, and must exist (whether externally or internally). Not really, for example XML documents acutally _are_ SGML documents. I strongly agree with your design rationale, but I still don't understand, why the manual implies, that the given construct is legal. -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus
HTML::Parser: report implicit events
Hi, I wonder, why HTML::Parser does not report implicit events. A conforming parser should report them in order to insure, that a correct parse tree could be build. An example: !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" title/title psome textimg alt='' src='' more texth1heading/h1 should report (omitting text and possibly default events) declaration start (html) start (head) start (title) end (title) end (head) start (body) start (p) start (img) end (img) end (p) start (h1) end (h1) end (body) end (html) I request an option to get those events. -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus
HTML::Tagset: p_closure_barriers
Hi, the HTML::Tagset manual defines the array @HTML::Tagset::p_closure_barriers. I don't understand the rationale behind this. The given example: html head titlefoo/title /head body pfoo table tr td foo pbar /td /tr /table /p /body /html _isn't_ legal. In SGML elements, that have optional end-tags are implicitly closed, if an element, that could not be contained inside the element (i.e. that is not part of the content model) occurs. Try to validate the example at [1] and you'll get Line 17, character 8: /p ^Error: end tag for element P which is not open; try removing the end tag or check for improper nesting of elements The parse tree of the document is something like html head title foo /title /head body p foo /p table tbody tr td foo p bar /p /td /tr /tbody /table /p error, the element was already closed /body /html [1] http://www.htmlhelp.com/tools/validator/direct.html -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus
Re: Calling IO::Socket::INET getting an error.
* Hansen, Keith wrote: I'm getting the error: Undefined subroutine IO::Socket::INET called at C:\Perl\gnx\lgetr.pl line 7. my $fh = IO::Socket::INET($server); IO::Socket::INET is a package and no subroutine, you have to call the constructor method: my $handle = IO::Socket::INET-new( ... ) # or ... = new IO::Socket::INET( ... ) -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- listen, learn, contribute -- David J. Marcus
Re: LWP::UserAgent -- Bug in $response-message
* Boris 'pi' Piwinger wrote: After some discussion in de.comp.lang.perl.misc ([EMAIL PROTECTED] ff) I assume there is a bug in LWP::UserAgent. I have a while loop reading URLs from a file. Those are fetched, if unsuccessful the response code and message are printed. The message becomes, e.g.: Can't connect to www.polybytes.com:80 (No route to host), FILE chunk 1. Obviously everything from the comma on is not correct at this place. Someone suggested that in LWP::Protocol::http::_new_socket the message does not end in \n. Yes, that was me. die() is used there and `perldoc -f die` reads | If the value of EXPR does not end in a newline, the current | script line number and input line number (if any) are also | printed, and a newline is supplied. Note that the "input line | number" (also known as "chunk") is subject to whatever notion of | "line" happens to be currently in effect, and is also available | as the special variable `$.'. See the section on "$/" in the | perlvar manpage and the section on "$." in the perlvar manpage. In LWP::UserAgent this is caught: | if ($use_eval) { | # we eval, and turn dies into responses below | eval { | $response = $protocol-request($request, $proxy, | $arg, $size, $timeout); | }; | if ($@) { | $@ =~ s/\s+at\s+\S+\s+line\s+\d+\.?\s*//; | $response = | HTTP::Response-new(HTTP::Status::RC_INTERNAL_SERVER_ERROR, | $@); | } The regular expression should be extended to remove also /,\s+HANDLE\s+(?:line|chunk)\s+\d+\.?\s*/ -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- If something is worth writing it is worth keeping --
HTML::Parser: document events
Hi, I suggest to introduce two new events for HTML::Parser: * init: raised when parse() is called the first time * eof: raised when eof() was called or parse_file() finishes This improves compatibility with other APIs like SAX or SAC. -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e -- If something is worth writing it is worth keeping --
Re: [Bug #131388] joining Location header results into wrong URLs
* [EMAIL PROTECTED] wrote: Some servers return two Location: headers (e.g. http://service.bfast.com/bfast/click?bfmid=20911217siteid=37451739bfpage=hplink after 2nd redirect - it's where code boiled out). push_header() will join URLs with ', ', and this is kinda wrong =) Quoting from RFC 2616 section 4.2: |Multiple message-header fields with the same field-name MAY be |present in a message if and only if the entire field-value for that |header field is defined as a comma-separated list [i.e., #(values)]. |It MUST be possible to combine the multiple header fields into one |"field-name: field-value" pair, without changing the semantics of the |message, by appending each subsequent field-value to the first, each |separated by a comma. The order in which header fields with the same |field-name are received is therefore significant to the |interpretation of the combined field value, and thus a proxy MUST NOT |change the order of these field values when a message is forwarded. It's an invalid response; LWPs treatment is 100% conforming, with respect to the fact, that LWP does not validate the HTTP messages. Better go and repair those servers. -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e ~~ will code for food. ~~
Re: possible bug in HTML::Parser comment handler
At 15:28 11.01.01 -0500, you wrote: It seems that the parser is not properly detecting multi-line HTML comments. I was trying to print out the dtext of a html document and noticed that comments kept showing up in the output. Upon further examination, the single line comments were being ignored but ones like this: !-- td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #00;} -- Well, the content model of the style element is CDATA, your "comments" may look like comments but they are no comments in HTML and SGML terms. That's not a bug. -- Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 Telefon: +49(0)4667/981028 http://bjoern.hoehrmann.de 25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://www.websitedev.de/
Re: HTTP redirects
* "Jarrett Carver" [EMAIL PROTECTED] wrote: | Is there a way to tell if your request has been redirected? if there is a '$response-previous' the request has been redirected. regards, -- Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 ° Telefon: +49(0)4667/981ASK ° http://bjoern.hoehrmann.de 25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote +{i} --- All I want for Christmas is well-formedness -- Evan Lenz ---
Re: LWP in C++ ?
* "Axel R." [EMAIL PROTECTED] wrote: | I would like know if a lib which have the same feteare as the LWP exist in C++ | or others language ? | I'm looking for something like the treebuilder and the HTML::element module... | Thank's for all http://www.w3.org/People/Raggett/tidy/ http://www.w3.org/Library/User/Guide/#HTML regards, -- Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de am Badedeich 7 ° Telefon: +49(0)4667/981ASK ° http://bjoern.hoehrmann.de 25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote +{i} --- Alles eine Frage der wissenschaftlichen Marsstäbchen ---
Call to HTTP::Request-url() should be -uri()
uri() is the real function, url() just exists for compatibility (i think). UserAgent.pm,v 1.73 2000/04/07 diff- 275c275 $referral-url($referral_uri); --- $referral-uri($referral_uri); - regards, -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://www.bjoernsworld.de am Badedeich 7 · Telefon: +49(0)4667/981ASK · http://bjoern.hoehrmann.de 25899 Dagebüll · PGP KeyID: 0xA4357E78 · http://learn.to/quote ---
HTTP::Response-base() should return absolut URI
Hi, HTTP::Response-base() is determined this way: Content-Base or Content-Location or Base or Request URL * Content-Base is defined in RFC 2068 as an absolute URI * Request URL is defined in all HTTP RFCs as an absolute URI * Base is not defined in RFC 1945 as the comment 'backwards compatability HTTP/1.0' implies, so i assume it refers to the Base Element in RFC 1866 which defines it as an absolute URI * Content-Location is defined in RFC 2068 and RFC 2616 as an absolute or relative URI For relative URIs RFC 2616 says: "[...]the relative URI is interpreted relative to the Request-URI." Sect. 5.1 of RFC 2396 tells us the same for base URIs in general and people expect an absolut URI as base, so the Content-Location should be transformed to an absolute URI before it's returned. regards, -- Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://www.bjoernsworld.de am Badedeich 7 · Telefon: +49(0)4667/981ASK · http://bjoern.hoehrmann.de 25899 Dagebüll · PGP KeyID: 0xA4357E78 · http://learn.to/quote ---