from:"Bjoern Hoehrmann"

Re: make test failed for Perl module Crypt-SSLeay-0.65_02

2014-03-18 Thread Bjoern Hoehrmann

* Jillapelli, Ramakrishna wrote:
I am getting the following error while doing make test for Perl module 
Crypt-SSLeay-0.65_02

# make test
PERL_DL_NONLAZY=1 /usr/bin/perl -MExtUtils::Command::MM -e 
 test_harness(0, 'blib/lib', 'blib/arch') t/*.t
t/00-basic.t  ok
t/01-connect.t .. ok
t/02-live.t . Can't locate Try/Tiny.pm in @INC (@INC contains: 
/home/rj46/Crypt-SSLeay-0.65_02/blib/lib 
/home/rj46/Crypt-SSLeay-0.65_02/blib/arch 
/usr/opt/perl5/lib/5.10.1/aix-thread-multi /usr/opt/perl5/lib/5.10.1 
/usr/opt/perl5/lib/site_perl/5.10.1/aix-thread-multi 
/usr/opt/perl5/lib/site_perl/5.10.1 .) at t/02-live.t line 4.
BEGIN failed--compilation aborted at t/02-live.t line 4.
t/02-live.t . Dubious, test returned 2 (wstat 512, 0x200)
No subtests run

You need http://search.cpan.org/dist/Try-Tiny/ but do note that the
underscore in the name `Crypt-SSLeay-0.65_02` indicates that it is
an experimental developer release for testing and not for deployment.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann

* Christopher J. Madsen wrote:
My repo is https://github.com/madsen/io-html but since it's built with
dzil, I also made a gist of the processed module to make it easier to
read the docs: https://gist.github.com/1623654

I took a quick look at HTTP::Message, and I think you'd just need to do

elsif ($self-content_is_html) {
   require IO::HTML;
   my $charset = IO::HTML::find_charset_in($$cref);
   return $charset if $charset;
}

You're already doing the BOM and valid-UTF8 checks; all you need is the
meta check, which is what find_charset_in does.

It is not clear to me that the combination would actually conform to the
HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32
BOMs, but as I recall the HTML5 proposal does not allow that.

Your UTF-8 validation code seems wrong to me, you consider the sequence
F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

Anyway, if people think this is the way to go, maybe HTTP::Message can
adopt the Content-Type header charset extraction tests in HTML::Encoding
so they don't get lost as my module becomes redundant?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann

* Christopher J. Madsen wrote:
Dropping support for UTF-32 from HTTP::Message is a separate issue from
removing HTML::Parser.  I've got no comment on that.

(It's not quite as black and white as that, HTML5 could be exempted in
the algorithm, for instance.)

 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

I guess the RE could be improved, but I'm not sure it's worth the effort
and added complication to catch a tiny fraction of false positives.

Why make the check at all if you don't care if it's right?

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

I thought it already did that?

Not as far as I can tell; links welcome though.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: libwww-6 versus libwww-5

2011-11-02 Thread Bjoern Hoehrmann

* H.Merijn Brand wrote:
When we (try to) communicate using SOAP with libwww-6.03, we get this
error:

error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure ...

This is the relevant error code. My guess would be there is a problem
with the certificates involved, but a quick search does not support
that. You could try running `openssl s_client -connect example.com:443`
(and possibly manually type some HTTP request after that) and see if
that provides any clues. There are also environment variables you can
set to control whether certificates are verified to rule that out as
a problem.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Where is the mailing list archive

2011-08-08 Thread Bjoern Hoehrmann

* Peng Yu wrote:
I'm not able to find an archive for recent messages on libwww-perl
mailing list. Does anybody know where is the archive?

http://www.nntp.perl.org/group/perl.libwww/ is one archive.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: use constant, Perl question

2011-07-02 Thread Bjoern Hoehrmann

* Gary Yang wrote:
I have hard time of understanding “use constant”. I do not understand
why place the “+” in front of constant variable. See the code below. In
the for loop, It adds “+” (+kAWSAccessKeyId,). In the new function call,
it adds the “+” (+RequestSignatureHelper::kAWSAccessKeyId = myAWSId). I
read the perldoc of “use constant”. But, no clue. Can someone explain
what it mean? Or point me any books or URLs I can read.  I got the code
below from some samples. If this mailing list is not the proper place to
ask this question, please tell me which mailing list is best for this
question.  Thanks.

This mailing list is dedicated to libwww-perl related issues, general
questions about Perl programming would be better placed on forums like
http://www.perlmonks.org/ or the comp.lang.perl.* newsgroups on Usenet.

As for your question, many of the + in the script are redundant, so
they are likely there for reasons of consistency. Sometimes you can use
the operator to force a particular interpretation, for instance,

  print +('xxx'), ('yyy');

will print 'xxxyyy' but without the + you get 'xxx'. Similarily,

$self-{+kRequestMethod}

The `constant.pm` module installs subroutines for the constants you
register, so this is read as a function call

  $self-{ kRequestMethod() }

while without the + it would be read as a string literal like

  $self-{ kRequestMethod }

I hope that helps,
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: How to ignore a named character reference in TreeBuilder?

2011-03-17 Thread Bjoern Hoehrmann

* Webley Silvernail wrote:
I have some XHTML in utf-8 that includes the named character reference #160; 
for non-breaking spaces.

(That is a numeric, not a named, character reference.)

The output is replacing the #160; with a non-printable character that is 
rendered in various agents as boxes or question marks enclosed in diamonds.

(That like means you've not specified the character encoding properly.)

I've tried adding an explicit decode/encode step and using HTML::Entities, but 
I've had no luck.  Basically, I just want TreeBuilder to ignore the #160; 
references and pass them through. 

I believe you are looking for the $element-as_HTML($entities) parameter
(see `perldoc HTML::Element` for details; set the parameter to a value
that includes all the characters you want to be escaped in the output).
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: HTTP::Parser problem

2011-02-06 Thread Bjoern Hoehrmann

* Gerhard Rieger wrote:
I experience a reproducable problem with the HTTP::Parser module. I run
the packaged Perl 5.10.1 on Ubuntu 10.4 Lucid. The problem appears with
the packaged version 3.64-1 of libhtml-parser-perl as well as with the
manually installed actual version 3.68.

You seem to be talking about HTML::Parser, not HTTP::Parser.

Problem: Under some circumstances the parser breaks short continuous
text into two parts on a white space and invokes the text_handler
callback for each of these parts. Especially while parsing tables this
breaks the structure of the result.

That would seem to work as intended; if you want everything as one
string, then you could buffer it, or perhaps use unbroken_text().
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Content-Disposition and utf8 filenames

2010-12-04 Thread Bjoern Hoehrmann

* Bill Moseley wrote:
I would like to use a user-supplied filename when returning a download (e.g.
pdf).  For example, might be filename=$title.pdf.  But $title can include
any character.

It seems like support for this in browsers is spotty:  See:
http://greenbytes.de/tech/tc2231/

Is anyone aware of a way to set this header to allow utf8 filenames that is
supported across browsers?

No, as you can tell from the results there is no one way supported by
all major browsers. The IETF HTTPbis Working Group is currently revising
the specification for it, and the recommended way to do this will be the
through the RFC 5987 style notation, where you can also specify a fall-
back value by using both the filename* and filename parameters. But no
silver bullet there.

Also, my assumption is HTTP::Headers expect encoded values -- that is the
values are octets not characters and so should always encode( 'US-ASCII' )
the value.

(That is generally correct, yes).

I just tried with Google Apps and it seems they turn any non A-Za-z into an
underscore.  Not sure that means they just didn't try hard or if they felt
it was not possible to use non ASCII characters in suggested filenames.

As I understand it, Google is currently investigating switching to RFC
5987 style encoding in some of their applications.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Downloading TSV files

2010-10-27 Thread Bjoern Hoehrmann

* Meir Guttman wrote:
I have also to warn you that this site is notoriously finicky. It is
probably overloaded (and overwhelmed) by users. Lately I experience
many types of false downloads and other problems that require repeated
downloading trials. (I'll supply the file integrity check.)

Well, that might be the only problem, I had no trouble downloading the
file using `wget --save-cookies=cookies --load-cookies=cookies --keep-
session-cookies ...` except that you apparently first have to open the
HTML version of the table and then immediately thereafter download the
TSV version (that might take a few tries, whether I use wget or the web
browser).
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: 'POST' method leads to 411 Length Required

2009-04-21 Thread Bjoern Hoehrmann

* Aaron Naiman wrote:
$request = new HTTP::Request 'POST', 'http://www.google.com';

$response = $ua-request($request);

You are using POST but are not posting any data, and apparently the
server ...

An Error Occurred 411 Length Required

... expects you to post some data when using POST.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: LWP content encode

2008-09-15 Thread Bjoern Hoehrmann

* stefano tacconi wrote:
I'm writing a simple script to download some web pages on the net.
Using LWP it's works fine, but how can I get html page with strange
characher?

You are probably looking for HTML::Encoding, the script in the synopsis
shows how to decode the content; HTTP::Response::Encoding seems to be a
rather crude module that is unaware of HTML semantics.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de

Re: libwww-perl-5.810 (HTTP::Message content not bytes)

2008-04-14 Thread Bjoern Hoehrmann

* Shaun wrote:
Don't allow HTTP::Message content to be set to Unicode strings.

I'm going to assume this change above is whats breaking my scripts.  My
script is using NicToolServerAPI.pm which in turn uses LWP::UserAgent.  The
script dies out with the error above when I run it.  I'm not sure what has
to be done, my guess is that the script is POST'ing the URL in plain text
and it needs to be in some other format.  I've snipped out a few pieces of
code in hopes that somebody here might be able to give me a quick fix.

Presumably you have to clear the string you are passing in of the UTF-8
flag, e.g. by using Encode::encode or perhaps Encode::_utf8_off.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

2007-09-23 Thread Bjoern Hoehrmann

* David Nesting wrote:
The most complete implementation imaginable would start with at least these:

text/html (html-specific rules)
text/xml (xml-specific rules)
text/* (general-purpose text rules)
application/*+xml (xml-specific rules)

HTML::Encoding does all of these, except text/* (for which there are no
rules beyond checking the charset parameter, though you might also try
to check for a Unicode signature at the beginning, which almost always
indicates the Unicode encoding form, HTML::Encoding can do both but is
not designed to do that for arbitrary types).

On the other hand, I'm less convinced now that dipping into the HTML or XML
content to figure out the proper encoding is necessarily the proper thing to
do here.  My complaint about LWP::Simple was that the HTTP Content-Type
(charset) information is lost by the time it gets to the caller.

Well that is necessarily so to keep the interface simple. Going from
LWP::Simple::get to LWP::UserAgent-new-get(...) is easy enough to not
warrant adding functionality to LWP::Simple.

I could see a case then for dealing with text/* only and returning octets
for everything else, since text/* is the only media type that has character
encoding details in the HTTP headers.

Actually that is not the case, there are plenty of, say, application/*
formats, like the XML types, that carry encoding information in the
header, without replicating it in the content (likewise, information in
the content may not be replicated in the header, and the two may contra-
dict each other).

Yes, it's still their fault for not coding a robust application, but
helping them do that is I think still a valid goal, if we can do it safely.

Well, automagic decoding of content cannot be added to LWP::Simple with-
out some opt-in switch as that would break a lot of programs, and if you
require some opt-in, you might as well require switching the module.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann

* David Nesting wrote:
For most uses of libwww, developers do little with character encoding.
Indeed, for general-case use of LWP::Simple, they can't, because that
information isn't even exposed.  Has any thought gone into doing this
internally within libwww, so that when I fetch content, I get back text
instead of octets?

Generally speaking, this is rather difficult as some content may not be
textual at all, and textual formats vary in how applications are to de-
tect the encoding (e.g., XML has different rules than HTML, text/plain
has no rules beyond looking at the charset parameter, and so on). If you
want a general-purpose solution, a good start would be a module taking a
HTTP::Response object and detecting the encoding, possibly decoding it
on request.

I'd be happy to help work on some of this, but the fact that I see no
use of character encodings within libwww makes me wonder if this is more
of a policy decision not to do it.

There was a bit of a discussion to somehow use HTML::Encoding for some
parts of it, which pretty much solves the problem for HTML and XML, cf
the list archives. Help on improving HTML::Encoding would be welcome,
I have little time to work on it at the moment.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann

* Bill Moseley wrote:
On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
 For most uses of libwww, developers do little with character encoding.
 Indeed, for general-case use of LWP::Simple, they can't, because that
 information isn't even exposed.  Has any thought gone into doing this
 internally within libwww, so that when I fetch content, I get back text
 instead of octets?

If you have the response object:

$response-decoded_content;

That removes content encodings like gzip and deflate, but David is
asking about character encodings like utf-8 and iso-8859-1. Content
encodings are applied after character encodings.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

2007-09-22 Thread Bjoern Hoehrmann

* Bill Moseley wrote:
sub decoded_content {


$content_ref = \Encode::decode($charset, $$content_ref,
   Encode::FB_CROAK() | Encode::LEAVE_SRC());

The documentation I re-read earlier even says that... This is still a
far cry from being generally useful though, it only works for text/*
and only if the encoding is specified in the header, or the format does
not use some kind of inline label that is inconsistent with the default.
Most of the time this is not the case, however.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Parsing q-values

2006-04-29 Thread Bjoern Hoehrmann

* Bill Moseley wrote:
Is there any existing code for parsing the q-values.  For example, to
get a list of Accept-Encoding values ordered by their q values?

HTTP::Negotiate?
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Base64 data: URLs vs url encoding

2006-04-19 Thread Bjoern Hoehrmann

Hi,

  print URI-new('data:;base64,QmpvZXJu')-data;
  print URI-new('data:;base64,%51%6D%70%76%5A%58%4A%75')-data;

I think this should both print Bjoern, but in the second case the
module returns garbage (URI 1.35).

regards,
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: HTML-Parser: storing into a DB words with special chars

2005-09-21 Thread Bjoern Hoehrmann

* thomas Armstrong wrote:
The document contains special characters ('música española' for instance), and
after storing it into de DB, I get this word: mÃºsica espaÃ±ola

How do you get it? You need to ensure that the database software
supports Unicode, that you properly store the data, and that you
properly retrieve and view the data. The string above is UTF-8 but
interpreted as in some other encoding (Windows-1252 or ISO-8859-1
for example). That's not a HTML::Parser issue.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Mailing list archives 2001-2005?

2005-08-12 Thread Bjoern Hoehrmann

* John J Lee wrote:
I'm sure somebody has one on the web.  Can anybody point me to it?  There
are lots of links to old sites that stop in 2001, and GMANE seems to start
in 2005, but I can't find anything between 2001 and 2005.

http://www.nntp.perl.org/group/perl.libwww
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: libwww throws warning when used with HTML::Parser 3.44

2005-01-03 Thread Bjoern Hoehrmann

* Karl DeBisschop wrote:
I just plugged in HTML::Parser 3.44 on my FC2 servers in order to handle 
utf-8 encoded content. (Boy was I glad to see that was available)

But when running a robot, LWP::Protocol emits a warning as it works 
because the content stream is not decoded into perl's native character set.

See http://www.nntp.perl.org/group/perl.libwww/6017 and the relevant
thread for a recent discussion on this. Your patch has a number of
problems, parsing the encoding out of the charset parameter is a bit
more difficult than your regular expression (e.g., the encoding name
might be a quoted-string as in charset=utf-8), the routine would now
croak in common cases such as an unsupported character encoding, and
it fails to deal with encodings such as ISO-2022-JP that maintain a
state (see Encode::PerlIO) or where characters might be longer than
one octet such as UTF-8 (consider one chunk has Bj\xC3 and the other
chunk has \xB6rn, you need to know the \xC3 when decoding the \xB6).
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Downloading a page compressed

2004-12-29 Thread Bjoern Hoehrmann

* Octavian Rasnita wrote:
Please tell me how can I use the $request-header() in order to request a
page in compressed format (with gzip).

HTTP/1.1 uses the TE and Accept-Encoding headers to specify that the
client supports gzip compression. LWP should take care of the TE header
automatically if the relevant modules are installed, in order to specify
the Accept-Encoding header you can use

  $request-header(Accept_Encoding = 'gzip')

Note that LWP does not automatically remove the gzip compression in this
case and that there is no gurantee that the resource will indeed be gzip
compressed by the server. If it is, this should be indicated in the
Content-Encoding header. For some resources it is however common that it
is not indicated in the response that the resource is compressed, see

  http://www.w3.org/mid/[EMAIL PROTECTED]

for some details.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: decoded_content

2004-12-04 Thread Bjoern Hoehrmann

* Gisle Aas wrote:
The current $mess-decoded_content implementation is quite naïve in
it's mapping of charsets.  It need to either start using Björn's
HTML::Encoding module or start doing similar sniffing to better guess
the charset when the Content-Header does not provide any.

http://search.cpan.org/dist/HTML-Encoding/. I very much welcome ideas
and patches that would help here. The module is currently just good
enough to replace the custom detection code in the W3C Markup Validator
check script (which is the basic motivation of the module ever since)
and in that pretty much ad hoc... I do indeed think that the libwww-perl
modules would be a better place for much of the functionality.

I also plan to expose a $mess-charset method that would just return
the guessed charset, i.e. something similar to
encoding_from_http_message() provided by HTML::Encoding.

A $mess-header_charset might be a good start here which just gives the
charset parameter in the content-type header. This would be what  

  HTML::Encoding::encoding_from_content_type($mess-header('Content-Type'))

does. HTTP::Message would be a better place for that code as the charset
parameter is far more common than just HTML/XML (all text/* types have
one, for example). The same probably goes for other things aswell such
as the BOM detection code in HTML::Encoding.

Another problem is that I have no idea how well the charset names
found in the HTTP/HTML maps to the encoding names that the perl Encode
module supports.  Anybody knows what the state here is?

Things might work out in common cases, but it's not quite where I think
it should be, I've recently started a thread on perl-unicode about it,
http://www.nntp.perl.org/group/perl.unicode/2648; I found that using
the I18N::Charset is needed in addition to Encode and that I18N::Charset
(still) lacks quite a number of mappings (see the comments in the source
of the module).

When this works the next step is to figure out the best way to do
streamed decoding.  This is needed for the HeadParser that LWP
invokes.

One problem here are stateful encodings such as UTF-7 or the ISO-2022
family of encodings as Encode::PerlIO notes (and attempts to work around
for many encodings). For example, the code you posted to perl-unicode
(re incomplete sequences) would fail for UTF-7 Bj+APY-rn if it happens
to split the string after Bj+APY which would be a complete sequence
but the meaning of the following -rn depends on the current state of
the decoder which decode() does not maintain, so it might sometimes
decode to Bjö-rn and sometimes to Björn which is not desirable (it
might have security implications, for example).

I am not sure whether there is an easy way to use the PerlIO workarounds
without using PerlIO. I've tried using PerlIO::scalar in HTML::Encoding,
but http://www.nntp.perl.org/group/perl.unicode/2675 it modifies the
scalar on some encoding errors and I did not investigate this further.
Maybe Encode should provide a simpler means for decoding possibly incom-
plete sequences...

Also, HTML::Parser might be the best blace to deal at least with the
case where the (or an) encoding is already known so it would decode the
bytes passed to it itself, I would then probably replace my poor custom
HTML::Encoding::encoding_from_meta_element with HTML::HeadParser looping
through possible encodings (probably giving up once that worked out, it
would currently decode with UTF-8 and ISO-8859-1 for most cases which is
quite unlikely to return different results...)
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: How can I become universal utf/unicode

2004-09-13 Thread Bjoern Hoehrmann

* J and T wrote:
Sometimes when fetching a document you have no idea the encoding and 
sometimes you do. What I want to know is how do I convert the incoming Web 
page regardless of encoding to UTF-8 as well as encode entities to something 
like Aacute (for keyword matching)?

You need to determine the character encoding of the document and then
transcode the byte stream to from the determined encoding to UTF-8.
There are a number of rules how to determine the character encoding of
text/html resources, these are unfortunately underspecified and contra-
dict each other and, worse, most documents do not have any encoding
information which means you would have to guess an encoding, or are
encoded using a different encoding than what they declare, in these
cases you would need to either reject the document or attempt to recover
from such problems.

There is a HTML::Encoding module on CPAN that can help you to determine
the encoding, but there are probably some bugs and the interface will
most certainly change once I get around to look at it again (I haven't
done so for years). It should however give a good starting point. If
that module (or similar code) does not yield in encoding information,
there is Encode::Guess which helps a bit to determine the encoding. More
sophisticated solutions than Encode::Guess are, AFAICT, not available on
CPAN. You could try to interface with or reuse code from some web
browsers, MSHTML for example would perform byte pattern analysis to
determine an encoding. A simpler approach would be to fallback to e.g.
Windows-1252, what you would do depends on how good you would like the
results to be. Over at the W3C Markup Validator we currently attempt to
use information as HTML::Encoding would report it and if that fails,
fall back to UTF-8 and if the document is not decodable as UTF-8, the
document is rejected. Which means that lots of documents are rejected.

Once the input is UTF-8 encoded, you can use HTML::Parser as usual. I
am not sure whether it sets the UTF-8 flag, but either way, it should
report the data in the same encoding so you could set the flag later.

Re: URI::file not RFC 1738 compliant?

2004-09-06 Thread Bjoern Hoehrmann

* Gisle Aas wrote:
 As far as I can tell, RFC 1738, section 3.10, as well as the BNF in
 section 5 explicitly say that file: URI must have two forward slashes
 before the optional hostname, followed by another forward slash, and
 then the path.

RFC 1738 is becoming a bit stale.  I do believe that the intent is for
'file' URIs to also follow the RFC 2396 syntax for hierarchical
namespaces which clearly states at the 'authority' is optional.

http://www.ietf.org/internet-drafts/draft-hoffman-file-uri-00.txt has
been published a few weeks ago, and there is an (ongoing?) discussion
http://lists.w3.org/Archives/Public/uri/2004Aug/thread.html#58 on the
future of the draft.

Sure.  Especially if I'm told about more apps that can't inter operate
with authority-less-file-URIs.  I might want to make it an option.

A C# program with

  Console.WriteLine((new Uri(file:///x)).LocalPath);
  Console.WriteLine((new Uri(file:///x:/y)).LocalPath);
  Console.WriteLine((new Uri(file:///x|/y)).LocalPath);
  Console.WriteLine((new Uri(file://x/y)).LocalPath);
  Console.WriteLine((new Uri(file:/x)).LocalPath);

running on Microsoft .NET 1.1 would print

  \\x
  x:\y
  x:\y
  \\x\y

and then

  Unhandled Exception: System.UriFormatException: Invalid URI:
  The format of the URI could not be determined.

for the file:/x case.

Re: RSS for this list

2003-06-06 Thread Bjoern Hoehrmann

* Sean M. Burke wrote:
As you probably know, this mailing list is archived at
   http://nntp.x.perl.org/group/perl.libwww
   news://nntp.perl.org/perl.libwww
But it might interest you to know that there's now an RDF feed for this 
mailing list:
   http://nntp.x.perl.org/rss/perl.libwww.rdf

It's still a bit experimental, so let me know if you run into any trouble 
with it.

Are these U+00ACs intentional?

Re: RSS for this list

2003-06-06 Thread Bjoern Hoehrmann

* Thurn, Martin wrote:
What application do I use to read/process an RDF file?

http://search.cpan.org/search?query=rss
http://search.cpan.org/search?query=rdf

Re: Fw: LWP does not do JavaScript! (was Re: Fw: Can't navigate to URL after login)

2002-08-10 Thread Bjoern Hoehrmann


* Sean M. Burke wrote:
At 20:20 2002-08-09 -0500, Tom wrote:
[...]Three of them are already available on CPAN, so I'm currently 
focusing on
implementing the HTML::DOM module.

Any comments?

Yes: I'm happy someone's actually doing this!  For ages, this JavaScript 
thing has been one of those projects where people keep saying Boy, it'd be 
nice if SOMEONE (and not me!) went and did that.

Guilty...

Tom, are you going to implement HTML DOM Level 1 or Level 2? The latter
is currently a working draft and has only minor changes, but those
changes are necessary in order to support XHTML and I would like to use
a HTML::DOM module for both, HTML and XHTML files.

I'd love to change my HTML::DOMbo module to build HTML::DOM trees instead 
of XML::DOM trees, since I've heard that the latter class isn't well supported.

Please be sure to provide a PerlSAX2 Reader and a PerlSAX2 Writer in
order to enable XML modules to use your tree.

Re: Doesn lynx support SSL

2002-04-26 Thread Bjoern Hoehrmann


* Siva Namburi wrote:
   Does anybody know if lynx would support ssl handshakes. I mean
can you browse sites which
have ssl enabled using lynx.

Take a look at http://lynx.browser.org, there you can find how to enable
SSL support in Lynx.

PS: This mailing list is dedicated to the libwww-perl library, Lynx is
not part of that lib and hence offtopic.

Re: Error 404 but file exists

2001-10-08 Thread Bjoern Hoehrmann


* regis rampnoux wrote:
Something strange for me:

%HEAD http://e.neilsen.home.att.net 
404 Not found
Connection: close
Date: Mon, 08 Oct 2001 19:18:09 GMT
Server: Netscape-Enterprise/3.6 SP2
Content-Type: text/html
Client-Date: Mon, 08 Oct 2001 19:18:09 GMT
Client-Peer: 204.127.135.37:80

But if I use a get it works!

Why?

The server is broken, it's no LWP problem.
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

RFC: HTML::Encoding

2001-07-28 Thread Bjoern Hoehrmann


Hi,

   I've written up a module that collects encoding informations for
(X)HTML files. (X)HTML files may carry encoding information in

  1. the higher-level protocol (e.g. the Content-Type headers charset
 parameter in HTTP and MIME)

  2. the XML declaration (for XHTML documents)

  3. the byte order mark at the beginning of the file

  4. meta elements like 

  meta http-equiv='Content-Type'
content='text/html;charset=iso-8859-1'

At user option it tries to extract the explicit given informations
information from these instances. After that process it sorts the
list according to the order above, in list context it returns the
list, in scalar context it returns the first encoding in the list
(i.e. the encoding the user agent must use to parse the document).

This looks like

  #!perl -w
  use strict;
  use warnings;
  use LWP::UserAgent;
  use HTML::Encoding '';
  
  my $r = LWP::UserAgent-new-request(
HTTP::Request-new(GET = 'http://www.w3.org/'));
  
  print scalar HTML::Encoding::get_encoding 
  
check_bom = 1,
check_xmldecl = 1,
check_meta= 1,
headers   = $r-headers,
string= $r-content

This would currently print out 'us-ascii' as http://www.w3.org/ returns
Content-Type: text/html;charset=us-ascii; in list context this would
return

  [
{ source = 4, encoding = 'us-ascii' },
{ source = 1, encoding = 'us-ascii' },
  ]

since the page has also a meta header

  meta http-equiv='Content-Type'
content='text/html;charset=us-ascii' /

The POD says:

[...]
  The source value is mapped to one of the constants FROM_META,
  FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these
  constants solely into your namespace or using the :constants
  symbol, e.g.

use HTML::Encoding ':constants';
[...]

This is usable if you want to check if there is a mismatch between the
declared encodings.

Some issues that came to my mind while writing this module:

  * HTTP::Headers should provide some information whether
LWP::UserAgent already parsed the header section of the
HTML file; so I wouldn't need to do the same thing again.
currently one cannot distinguish if there were multiple
Content-Type: headers in the original response or if they
come from meta elements

  * HTML::Encoding currently uses HTML::Parser to extract the
meta element if version 3.21 or later is available (maybe
I'll switch to HTML::HeadParser ...) The problem is, that
HTML::Parser is AFAIK currently unable to process documents
encoded in some encoding that is not compatible with
US-ASCII (UTF-16BE for example)

I think it is out of scope of HTML::Encoding to recode the
given string to some US-ASCII compatible encoding (that'd
be UTF-8) in order to parse the document; this should be
done by HTML::Parser using some encoding parameter.

Personally I'd say that HTML::Parser should only output
UTF-8 encoded characters as XML::Parser does, but this
will certainly clash with current users who expect to get
ISO-8859-1 or something like that out of it...

Is it likely that HTML::Parser incorporates such a feature
using the Unicode::* modules or Text::Iconv or whatever is
currently available?

  * Is the currently really no module that does what
HTML::Encoding is supposed to do? In general you have to
use the module everytime you try to do anything with an
HTML document; hmm, maybe western people got too used to
ISO-8859-1...

The current version can be found at

  http://www.websitedev.de/perl/HTML-Encoding-0.01.tar.gz

You'll currently need Perl 5.6.0 to use it. The file currently lacks of
a proper README and test files...

Is the module name appropriate? Any other comments or suggestions? I
greatly appreciate them :-)

Thanks for your time,
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: Bug in LWP::UserAgent?

2001-07-27 Thread Bjoern Hoehrmann


* Gisle Aas wrote:
 I believe that this behavior is due to the UserAgent because using telnet I
 do not get multiple 'Content-Type' definitions in the response from the
 server.  The link at the top of this message has more information on the
 matter.  I am working out a workaround in the proxy server, but I wonder if 
 this is not something that should be addressed in the libwww-perl codebase.

What do you think it should do?

HTTP::Headers should have some method to determine whether the body was
parsed or not. Not only usable in this case.
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

HTML::Parser::Tidy namespace

2001-07-20 Thread Bjoern Hoehrmann


Hi Gisle,
Hi list,

   maybe you have heard of HTML Tidy, a free utility by Dave Raggett to
repair, cleanup and pretty-print HTML and XHTML documents. Please refer
to

  http://www.w3.org/People/Raggett/tidy/

for more information on HTML Tidy.

HTML Tidy is currently maintained by a group of developers including
myself at Sourceforge. One of our goals is to create a free-standing
C-library out of Tidy to ease it's reuse in other applications, see 

  http://sourceforge.net/projects/tidy/

for more information on this project.

I'm going to write an Perl XS interface to this library [1] and the best
name I could think of was

  HTML::Parser::Tidy

HTML::Tidy is already taken and something like HTML::PerlTidy implies,
that this cleans up some perl code. I'll try to provide an interface
compatible [2] to HTML::Parser 3.x so that applications built upon
HTML::Parser will be able to use Tidy as an alternative. My current
module provides a simple (XML::Parser::Perl)SAX interface so that I can
use the module to build up a DOM tree for e.g. XML::DOM or XML::XPath.
I'm currently considering if it's worth to expose the whole the DOM-like
functions of HTML Tidy...

However, I'd like to ask if the module name is ok for you, Gisle, and
for others or if one of you as a better suggestion.

The module will be maintained at

  http://sourceforge.net/projects/ptidy

(where you'll currently find nothing but empty pages :-)

If some people are interested in this project, feel free to subscribe to
the [EMAIL PROTECTED], but be aware that I'll bug you
with questions about the interface/documentation/etc.pp. ;-)

[1] I already did something like that in april this year, but I came to
the conclusion, that HTML Tidy should be a real library to use all
it's features in other applications easily, so we started that
project.

[2] To some extent, some features aren't possible.

Thanks for your comments,
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: HTML::Parser::Tidy namespace

2001-07-20 Thread Bjoern Hoehrmann


* Gisle Aas wrote:
What is HTML::Tidy? I could not find anything that use that name.

There is currently no such module on CPAN. Weired. I was _very_ sure
I've seen such a module in version 0.0x one day in some CPAN directory.
Hm, maybe this was just some private project. However, that's great,
I'll take HTML::Tidy then :-)

Thanks.
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: HTML::Parser::Tidy namespace

2001-07-20 Thread Bjoern Hoehrmann


* [EMAIL PROTECTED] wrote:
Bjoern,

Very interesting project -- I've been lurking around watching the libTidy
development over the last few weeks, and have wished for something like
this. HTML::Clean just doesn't provide the capabilities I need.

HTML::Clean does even evil things like replacing strong with b
because it's shorter :-(

Have you considered using Inline::C rather than native XS? That seems to
be the way to go these days.

Is it? Can you briefly explain what the advantages are?
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: HTML::Parser::Tidy namespace

2001-07-20 Thread Bjoern Hoehrmann


* Sean M. Burke wrote:
So the C-Tidy-library builds a document tree for some HTML file, and then
SAX walks the tree so that you can, via Perl, build a new (in-Perl) tree
for it, using the tree library of your choice (XML::DOM, XML::Element, or
even some crazy thing called HTML::Element) ?

Yes. Something similar goes for HTML::Parser-style events. This is one
reason, why HTML::Tidy could not support all HTML::Parser events, e.g.
informations like 'offset', 'line', 'column' or 'tokens' get lost in the
parsing/cleanup process. Ok, it would be possible by making a lot of
changes in Tidy, but I don't think it's worth the effort, Tidy's power
_is_ the clean-tree generation.

I dimly (mis?)remember looking at Tidy's internals months ago, and I think
I remember that it stored everything as double-byte Unicode strings -- so I
presume that those get UTF8ified (and tagged as such) when passed to Perl,
right?

Tidy stores all character data as UTF-8 encoded char*s. They will be
passed as UTF-8 to Perl (tagged as such via SvUTF8_on()) or, for the
pretty-printer, in your desired encoding (if supported).

Does it deal nicely with non-UTF8 non-Latin-1 input encodings?

Currently tidy supports

  * us-ascii *
  * iso-8859-1 *
  * windows-1252
  * mac-roman
  * the iso-2022 family *
  * utf-8*

[*] denotes supported output encoded

I'm not sure about UTF-16, if it doesn't support it, I'll add support
for it. We have feature requests for

  * ShiftJIS
  * BIG5

You have, however, to declare what encoding you are using. One might use
Unicode::Map8 or Text::Iconv to convert strings to utf-8 before passing
them to Tidy.
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: IRI Implementation

2001-05-15 Thread Bjoern Hoehrmann


* John Stracke wrote:
Gisle Aas wrote:

 But I don't really understand how IRIs solve any problem if you still
 can't use non-ASCII characters in host names.

 Don't you want to be http://björn.höhrmann.de?   :-)

... and björn@höhrmann.de, yes. I must have these addresses, that's what
i18n is all about, isn't it? ;-)

It's coming; see http://www.ietf.org/html.charters/idn-charter.html.
(There are people trying to jump the gun, but it turns out to be a hard
problem.  For example, should björn.höhrmann.de be equivalent to
bjorn.hohrmann.de? Or consider Hebrew, where vowels are not letters, and
often omitted; should two domain names that differ only in the vowels be
equivalent?)

In the meantime, IRIs let you have non-ASCII in the file names, at least.

Better: in the path, query and fragment component.

IRIs solve a _very_ common problem. Consider a search engine with a
method='get' form. If you use this search engine to search for e.g.
'Björn Höhrmann' how to encode the URI? And if some CGI script receives
the data, how to decode it if one doesn't know what character encoding
was used to encode the characters? There is currently no definition how
to handle this case. Most user agents use the chosen (i.e. in most cases
the encoding of the current (X)HTML document) to encode the characters
but since HTML doesn't define any default encoding, you have still no
clue what happens to your queries.
-- 
:: no signature today :: in memoriam Douglas Adams ::

RFC 2732 implementation in URI.pm

2001-05-14 Thread Bjoern Hoehrmann


Hi,

   RFC 2732 extends RFC 2396 to accept literal IPv6 addresses, does
URI.pm implement this RFC?
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

IRI Implementation

2001-05-14 Thread Bjoern Hoehrmann


Hi Gisle,
Hi List,

   At [1] you can currently find the latest Internet Draft for IRIs
(Internationalized Resource Identifiers). IRIs are similar to URIs but
the IRI syntax allows far more ISO/IEC 10646 (Unicode) characters in the
grammer, while URIs are limited to US-ASCII characters. Each URI is
supposed to be a valid IRI. Is there any ongoing development to support
IRIs in Perl? I don't think so, so I suggest to add support hereby. I'm
curious to know, how an implementation could look like. One could say
that URIs are a subset of IRIs, so it would be reasonable to consider
moving current URI processing code to an generic IRI parser and make the
URI stuff a derieved class of some IRI module. What about that?

[1] http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: RFC 2732 implementation in URI.pm

2001-05-14 Thread Bjoern Hoehrmann


* Gisle Aas wrote:
Bjoern Hoehrmann [EMAIL PROTECTED] writes:

RFC 2732 extends RFC 2396 to accept literal IPv6 addresses, does
 URI.pm implement this RFC?

No.

$ perl -MURI -le 'print URI-new(http://[::192.9.5.5]/ipng;)'
http://%5B::192.9.5.5%5D/ipng

Is anybody using URIs of this form yet?

Operating systems just started to handle IPv6 (latest FreeBSD,
experimental update for Windows 2000, etc.) so using them isn't easy
yet, but several applications like Apache 2.0 or Apache Tomcat can
already handle such URIs.

Is there agreement that following this proposal is a good thing?

Other specifications make use of (or at least reference) this RFC like
XML 1.0, XML Base, XML Schema, XML DSiG and the upcoming IRI Standards
Track, so I think yes, they agree that this is a good thing [tm].
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: RFC 2732 implementation in URI.pm

2001-05-14 Thread Bjoern Hoehrmann


* Gisle Aas wrote:
As far as I can tell all that is needed for RFC 2732 support is to add
[ and ] to reserved characters, i.e. this patch:

Do you think there should be more?

I'll shout when it isn't sufficient :-)

BTW, how does XML 1.0 reference RFC 2732?

You'll see it twice in section 4.2.2. and as part of appendix 2, see
http://www.w3.org/TR/REC-xml
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: Requests with Transfer-encoding: chunked

2001-04-25 Thread Bjoern Hoehrmann


* Gisle Aas wrote:
Does anybody know about live servers that allow POSTing with
'Transfer-encoding: chunked' in the request?  I want to test the
support I added to LWP.

Anybody knows about servers that implement 'Transfer-Encoding:
deflate' (or gzip) and will apply it to the response if I send the
approate TE:-header in the request?

Jigsaw, http://jigsaw.w3.org/ / http://www.w3.org/Jigsaw/
Apache2, http://httpd.apache.org (maybe with mod_gzip)

Jigsaw test case fot the second one:
http://jigsaw.w3.org/HTTP/TE/foo.txt
(see http://jigsaw.w3.org/HTTP/ for an overview)
-- 
Björn Höhrmann { mailto:[EMAIL PROTECTED] } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Re: Bug Report

2001-04-10 Thread Bjoern Hoehrmann


* Laurent Simonneau wrote:
Why the character '|' is converted to '%7C' in URLs ?

RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt) defines it as an unwise
character, they must be escaped in URIs.

Exemple : 
libwww-perl send :

GET http://www.lycos.fr/cgi-bin/nph-bounce?LIA14%7C/service/sms/
HTTP/1.0

And the server reply a 404 not found error.

That's a bug, maybe Lycos people should get better software, libwww-perl
behaves correctly.
-- 
Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7 ° Telefon: +49(0)4667/981028 ° http://bjoern.hoehrmann.de
25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

Re: HTML::TreeBuilder method/madness. Was: HTML::Tagset: p_closure_barriers

2001-04-08 Thread Bjoern Hoehrmann


* Sean M. Burke wrote:
As far as I understand SGML,

A. SGML is a family of markup languages where, when (start|end)-tag
omission is enabled, nothing can be parsed witout a DTD.

SGML is a language to define markup languages. If start end/or end tags
are omitted, you must know the content model of the element. The content
model is defined in the DTD. If your application stores the content
model in a different format, you don't need a DTD. HTML::Tagset does it
this way. It does this in a manner i don't like, i.e. not conforming to
the HTML 4.01 DTDs, e.g. %HTML::Tagset::optionalEndTag only stores those
elements, where the end tag is 'safely' omittable.

B. For basically every SGML document, a complete (!) DTD exists, and must
exist (whether externally or internally).

Not really, for example XML documents acutally _are_ SGML documents.

I strongly agree with your design rationale, but I still don't
understand, why the manual implies, that the given construct is legal.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

HTML::Parser: report implicit events

2001-04-07 Thread Bjoern Hoehrmann


Hi,

I wonder, why HTML::Parser does not report implicit events. A conforming
parser should report them in order to insure, that a correct parse tree
could be build. An example:

  !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
  title/title
  psome textimg alt='' src='' more texth1heading/h1

should report (omitting text and possibly default events)

  declaration
  start (html)
start (head)
  start (title)
  end (title)
end (head)
start (body)
  start (p)
start (img)
end (img)
  end (p)
  start (h1)
  end (h1)
end (body)
  end (html)

I request an option to get those events.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

HTML::Tagset: p_closure_barriers

2001-04-07 Thread Bjoern Hoehrmann


Hi,

the HTML::Tagset manual defines the array
@HTML::Tagset::p_closure_barriers. I don't understand the rationale
behind this. The given example:

  html
head
  titlefoo/title
/head
body
  pfoo
table
  tr
td
   foo
   pbar
/td
  /tr
/table
  /p
/body
  /html

_isn't_ legal. In SGML elements, that have optional end-tags are
implicitly closed, if an element, that could not be contained inside the
element (i.e. that is not part of the content model) occurs. Try to
validate the example at [1] and you'll get 

Line 17, character 8: 
/p
   ^Error: end tag for element P which is not open; try removing the
end tag or check for improper nesting of elements

The parse tree of the document is something like

  html
head
  title
 foo
  /title
/head
body
  p
 foo
  /p
  table
tbody
  tr
td
 foo
  p
   bar
  /p
/td
  /tr
/tbody
  /table
  /p
  
   error, the element was already closed

/body
  /html

[1] http://www.htmlhelp.com/tools/validator/direct.html
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

Re: Calling IO::Socket::INET getting an error.

2001-03-25 Thread Bjoern Hoehrmann


* Hansen, Keith wrote:
I'm getting the error:

  Undefined subroutine IO::Socket::INET called at C:\Perl\gnx\lgetr.pl line
7.


my $fh = IO::Socket::INET($server);

IO::Socket::INET is a package and no subroutine, you have to call the
constructor method:

  my $handle = IO::Socket::INET-new( ... ) # or
 ... = new IO::Socket::INET( ... )
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- listen, learn, contribute -- David J. Marcus

Re: LWP::UserAgent -- Bug in $response-message

2001-03-13 Thread Bjoern Hoehrmann


* Boris 'pi' Piwinger wrote:
After some discussion in de.comp.lang.perl.misc
([EMAIL PROTECTED] ff) I assume there is a bug in
LWP::UserAgent. I have a while loop reading URLs from a file. Those
are fetched, if unsuccessful the response code and message are
printed. The message becomes, e.g.:

Can't connect to www.polybytes.com:80 (No route to host), FILE chunk
1.

Obviously everything from the comma on is not correct at this place.
Someone suggested that in LWP::Protocol::http::_new_socket the message
does not end in \n.

Yes, that was me. die() is used there and `perldoc -f die` reads

| If the value of EXPR does not end in a newline, the current
| script line number and input line number (if any) are also
| printed, and a newline is supplied. Note that the "input line
| number" (also known as "chunk") is subject to whatever notion of
| "line" happens to be currently in effect, and is also available
| as the special variable `$.'. See the section on "$/" in the
| perlvar manpage and the section on "$." in the perlvar manpage.

In LWP::UserAgent this is caught:

| if ($use_eval) {
|   # we eval, and turn dies into responses below
|   eval {
|   $response = $protocol-request($request, $proxy,
|  $arg, $size, $timeout);
|   };
|   if ($@) {
|   $@ =~ s/\s+at\s+\S+\s+line\s+\d+\.?\s*//;
|   $response =
| HTTP::Response-new(HTTP::Status::RC_INTERNAL_SERVER_ERROR,
| $@);
|   }

The regular expression should be extended to remove also

  /,\s+HANDLE\s+(?:line|chunk)\s+\d+\.?\s*/
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- If something is worth writing it is worth keeping --

HTML::Parser: document events

2001-03-11 Thread Bjoern Hoehrmann


Hi,

I suggest to introduce two new events for HTML::Parser:

  * init: raised when parse() is called the first time
  * eof: raised when eof() was called or parse_file() finishes

This improves compatibility with other APIs like SAX or SAC.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
-- If something is worth writing it is worth keeping --

Re: [Bug #131388] joining Location header results into wrong URLs

2001-02-08 Thread Bjoern Hoehrmann


* [EMAIL PROTECTED] wrote:
Some servers return two Location: headers
(e.g.
http://service.bfast.com/bfast/click?bfmid=20911217siteid=37451739bfpage=hplink
after 2nd redirect - it's where code boiled out). push_header() will join
URLs with ', ', and this is kinda wrong =)

Quoting from RFC 2616 section 4.2:

|Multiple message-header fields with the same field-name MAY be
|present in a message if and only if the entire field-value for that
|header field is defined as a comma-separated list [i.e., #(values)].
|It MUST be possible to combine the multiple header fields into one
|"field-name: field-value" pair, without changing the semantics of the
|message, by appending each subsequent field-value to the first, each
|separated by a comma. The order in which header fields with the same
|field-name are received is therefore significant to the
|interpretation of the combined field value, and thus a proxy MUST NOT
|change the order of these field values when a message is forwarded.

It's an invalid response; LWPs treatment is 100% conforming, with
respect to the fact, that LWP does not validate the HTTP messages.
Better go and repair those servers.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote [!]e
~~ will code for food. ~~

Re: possible bug in HTML::Parser comment handler

2001-01-11 Thread Bjoern Hoehrmann


At 15:28 11.01.01 -0500, you wrote:
It seems that the parser is not properly detecting multi-line HTML
comments.  I was trying to print out the dtext of a html document and
noticed that comments kept showing up in the output.  Upon further
examination, the single line comments were being ignored but ones like
this:

!--
td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #00;}
--

Well, the content model of the style element is CDATA, your "comments"
may look like comments but they are no comments in HTML and SGML
terms. That's not a bug.
-- 
Bjrn Hhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7  Telefon: +49(0)4667/981028  http://bjoern.hoehrmann.de
25899 Dagebll # PGP Pub. KeyID: 0xA4357E78 # http://www.websitedev.de/

Re: HTTP redirects

2000-09-07 Thread Bjoern Hoehrmann


* "Jarrett Carver" [EMAIL PROTECTED] wrote:
| Is there a way to tell if your request has been redirected?

if there is a '$response-previous' the request has been redirected.

regards,
--
Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7 ° Telefon: +49(0)4667/981ASK ° http://bjoern.hoehrmann.de
25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote +{i}
--- All I want for Christmas is well-formedness -- Evan Lenz ---

Re: LWP in C++ ?

2000-08-26 Thread Bjoern Hoehrmann


* "Axel R." [EMAIL PROTECTED] wrote:
| I would like know if a lib which have the same feteare as the LWP exist in
C++
| or others language ?
| I'm looking for something like the treebuilder and the HTML::element
module...
| Thank's for all

http://www.w3.org/People/Raggett/tidy/
http://www.w3.org/Library/User/Guide/#HTML

regards,
--
Björn Höhrmann ^ mailto:[EMAIL PROTECTED] ^ http://www.bjoernsworld.de
am Badedeich 7 ° Telefon: +49(0)4667/981ASK ° http://bjoern.hoehrmann.de
25899 Dagebüll # PGP Pub. KeyID: 0xA4357E78 # http://learn.to/quote +{i}
  --- Alles eine Frage der wissenschaftlichen Marsstäbchen ---

Call to HTTP::Request-url() should be -uri()

2000-08-10 Thread Bjoern Hoehrmann


uri() is the real function, url() just exists for compatibility (i think).

UserAgent.pm,v 1.73 2000/04/07

diff-
275c275
   $referral-url($referral_uri);
---
   $referral-uri($referral_uri);
-

regards,
--
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://www.bjoernsworld.de
am Badedeich 7 · Telefon: +49(0)4667/981ASK · http://bjoern.hoehrmann.de
25899 Dagebüll · PGP KeyID:  0xA4357E78 · http://learn.to/quote ---

HTTP::Response-base() should return absolut URI

2000-08-06 Thread Bjoern Hoehrmann


Hi,

HTTP::Response-base() is determined this way:

  Content-Base or
  Content-Location or
  Base or
  Request URL

 * Content-Base is defined in RFC 2068 as an absolute URI
 * Request URL is defined in all HTTP RFCs as an absolute URI
 * Base is not defined in RFC 1945 as the comment 'backwards
   compatability HTTP/1.0' implies, so i assume it refers
   to the Base Element in RFC 1866 which defines it as an
   absolute URI
 * Content-Location is defined in RFC 2068 and RFC 2616 as an
   absolute or relative URI
   For relative URIs RFC 2616 says: "[...]the relative URI is
   interpreted relative to the Request-URI."

Sect. 5.1 of RFC 2396 tells us the same for base URIs in general and people
expect an absolut URI as base, so the Content-Location should be transformed
to an absolute URI before it's returned.

regards,
--
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://www.bjoernsworld.de
am Badedeich 7 · Telefon: +49(0)4667/981ASK · http://bjoern.hoehrmann.de
25899 Dagebüll · PGP KeyID:  0xA4357E78 · http://learn.to/quote ---

56 matches

Mail list logo