Re: HTML-Tree gets weakref support!
HTML-Tree 5.00 is now available on CPAN: https://metacpan.org/release/HTML-Tree Release announcement: http://blogs.perl.org/users/christopher_j_madsen/2012/06/html-tree-5.html -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
HTML-Tree gets weakref support!
At long last, HTML-Tree can now use weakrefs. This means you no longer need to call $tree-delete to prevent memory leaks. Right now, you can get trial release 4.900 from CPAN: https://metacpan.org/release/CJM/HTML-Tree-4.900-TRIAL/ I plan to release 5.00 before YAPC::NA, probably on June 12. (This would probably be a good subject for a lightning talk, if someone wants to give it. I'd do it myself, but I'm not going to be able to make it to YAPC this year.) If you're using HTML-Tree, please test your code with the new version now. While most programs should continue to work fine with this change, it could break your code. The one real-world example I've found so far is pQuery's dom.t. In pQuery 0.08, it does: my @elems = pQuery::DOM-fromHTML('divxxx!-- yyy --zzz/div') -childNodes; my $comment = $elems[1]; is $comment-parentNode-tagName, 'DIV', 'Comment has parentNode'; Notice that it's not saving the result of the fromHTML call; only the child nodes. Since children now have only a weak reference to their parent, the root node is deleted immediately, and $comment-parentNode is undef. This can be fixed by saving a reference to the root node: my @elems = (my $r = pQuery::DOM -fromHTML('divxxx!-- yyy --zzz/div')) -childNodes; As a quick fix for broken code (and to determine whether it's the weak references that are causing the breakage), you can say: use HTML::Element -noweak; This (globally) disables HTML-Tree's use of weak references. But this is just a temporary measure. You need to fix your code, because this feature will be going away eventually. -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
Freeing HTTP::Message from HTML::Parser dependency
I stumbled across this bug: https://rt.cpan.org/Ticket/Display.html?id=66313 and a discussion here about removing HTTP::Message's dependency on HTML::Parser (which needs a C compiler) for charset sniffing. As it happens, I'm about to release a new dist that implements the HTML5 encoding sniffing algorithm in pure-Perl with no non-core dependencies for 5.8+. While its primary function is to make it dead simple to open a HTML file and get the right encoding layer applied automatically, it also exposes the underlying mechanism. My repo is https://github.com/madsen/io-html but since it's built with dzil, I also made a gist of the processed module to make it easier to read the docs: https://gist.github.com/1623654 I took a quick look at HTTP::Message, and I think you'd just need to do elsif ($self-content_is_html) { require IO::HTML; my $charset = IO::HTML::find_charset_in($$cref); return $charset if $charset; } You're already doing the BOM and valid-UTF8 checks; all you need is the meta check, which is what find_charset_in does. One possible issue is that find_charset_in returns Perl's canonical name for the encoding, which is not necessarily the same as the MIME or IANA charset name. You could do return Encode::find_encoding($charset)-mime_name if $charset; if you want. I'm planning to release this in a week or so, after I see if any more bugs pop up or I think of any API changes I should make. -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
Re: Freeing HTTP::Message from HTML::Parser dependency
On 1/16/2012 6:53 PM, Bjoern Hoehrmann wrote: * Christopher J. Madsen wrote: My repo is https://github.com/madsen/io-html but since it's built with dzil, I also made a gist of the processed module to make it easier to read the docs: https://gist.github.com/1623654 It is not clear to me that the combination would actually conform to the HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32 BOMs, but as I recall the HTML5 proposal does not allow that. Dropping support for UTF-32 from HTTP::Message is a separate issue from removing HTML::Parser. I've got no comment on that. Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
Re: Freeing HTTP::Message from HTML::Parser dependency
On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote: * Christopher J. Madsen wrote: Your UTF-8 validation code seems wrong to me, you consider the sequence F0 80 to be incomplete, but it's actually invalid, same for ED 80, see the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design. I guess the RE could be improved, but I'm not sure it's worth the effort and added complication to catch a tiny fraction of false positives. Why make the check at all if you don't care if it's right? I can't use a simple utf8::decode check, because I read a fixed number of bytes, and that might have cut a multi-byte character in half. So I use Encode::FB_QUIET, and then check the leftovers to make sure that it's a single, plausible, partial UTF-8 character. I have to check the leftovers, or the whole test would be meaningless. I just make sure it's a start byte followed by an appropriate number of continuation bytes. As you say, certain start bytes can't validly be followed by certain continuation bytes, but writing an RE for those rules is more complexity than I think the problem warrants. What are the odds that I had 1021 bytes of valid UTF-8 (including at least 1 multi-byte character) followed by bytes that match my current RE but a strict test could have rejected? I'm already just assuming that the next bytes would be additional continuation bytes. Anyway, if people think this is the way to go, maybe HTTP::Message can adopt the Content-Type header charset extraction tests in HTML::Encoding so they don't get lost as my module becomes redundant? I thought it already did that? Not as far as I can tell; links welcome though. At the beginning of content_charset, it calls content_type_charset (which is actually a HTTP::Headers method). Or were you talking about t/01http.t and its associated input files? -- Chris Madsen p...@cjmweb.net http://www.cjmweb.net
[PATCH] Updated support for --full-time in File::Listing
Attached is a patch against LWP 5.800 (or 5.79, the files didn't change) to allow File::Listing to interpret the output of GNU ls's --full-time option (tested with GNU ls 4.1 and 4.5, which unfortunately have completely different formats). This allows you to get timestamps accurate to the second, instead of the minute-based ones you get with a normal ls -l. It also handles BSD ls's -T option (which is similar to GNU ls 4.1's --full-time option). (Thanks to Ville Skyttä, who sent an example of a BSD listing, confirming that it does look like the manpage said.) The new time formats are recognized automatically; you just call parse_dir like you normally would. This time I made sure to test the patch file; it should apply cleanly. Please CC me on responses; I'm not subscribed to the list. -- Chris Madsen[EMAIL PROTECTED] -- http://www.pobox.com/~cjm -- --- c:/tmp/lib/File/Listing.pm Sun Oct 26 08:24:22 2003 +++ lib/File/Listing.pm Tue Jun 15 15:33:10 2004 @@ -144,7 +144,9 @@ .* # Graps \D(\d+) # File size \s+ # Some space -(\w{3}\s+\d+\s+(?:\d{1,2}:\d{2}|\d{4})) # Date +(\w{3}\s+\d+\s+(?:\d{1,2}:\d{2}|\d{4}) | # Date + \w{3}\s+(?:\w{3}\s+)?\d+\s+\d{1,2}:\d{2}(?::\d{2})?\s+\d{4} | # or Full date + \d{4}-\d\d-\d\d\s+\d{1,2}:\d\d(?::\d\d(?:\.\d+)?)?(?:\s+[-+]\d{4})?) # or ISO date \s+ # Some more space (.*)$# File name /x ) @@ -371,10 +373,12 @@ =head1 DESCRIPTION This module exports a single function called parse_dir(), which can be -used to parse directory listings. Currently it only understand Unix -C'ls -l' and C'ls -lR' format. It should eventually be able to -most things you might get back from a ftp server file listing (LIST -command), i.e. VMS listings, NT listings, DOS listings,... +used to parse directory listings. Currently it only understands Unix +C'ls -l' and C'ls -lR' format. It also understands the +C--full-time option of GNU Bls and the C-T option of BSD Bls. It +should eventually be able to parse most things you might get back from +a ftp server file listing (LIST command), i.e. VMS listings, NT +listings, DOS listings,... The first parameter to parse_dir() is the directory listing to parse. It can be a scalar, a reference to an array of directory lines or a --- c:/tmp/t/base/listing.t Thu Nov 14 07:07:44 1996 +++ t/base/listing.tThu Jun 17 10:55:10 2004 @@ -1,4 +1,4 @@ -print 1..6\n; +print 1..26\n; use File::Listing; @@ -84,3 +84,167 @@ $mode == 0100644 || print not ; print ok 6\n; + + +# Test GNU ls version 4.1 -l --full-time format: +$full_time_dir = 'EOL'; +total 68 +drwxr-xr-x4 aas users1024 Tue Mar 16 15:47:02 2004 . +drwxr-xr-x 11 aas users1024 Mon Mar 15 19:22:31 2004 .. +drwxr-xr-x2 aas users1024 Tue Mar 16 15:47:40 2004 CVS +-rw-r--r--1 aas users2384 Thu Feb 26 21:14:22 2004 Debug.pm +-rw-r--r--1 aas users2145 Thu Feb 26 20:09:12 2004 IO.pm +-rw-r--r--1 aas users3960 Mon Mar 15 18:05:50 2004 MediaTypes.pm +-rw-r--r--1 aas users 792 Thu Feb 26 20:12:44 2004 MemberMixin.pm +drwxr-xr-x3 aas users1024 Mon Mar 15 18:05:33 2004 Protocol +-rw-r--r--1 aas users5613 Thu Feb 26 20:16:22 2004 Protocol.pm +-rw-r--r--1 aas users5963 Wed Feb 26 21:27:11 2003 RobotUA.pm +-rw-r--r--1 aas users5071 Tue Mar 16 12:25:58 2004 Simple.pm +-rw-r--r--1 aas users8817 Sat Mar 15 18:05:59 2003 Socket.pm +-rw-r--r--1 aas users2121 Tue Feb 5 14:22:00 2002 TkIO.pm +-rw-r--r--1 aas users 19628 Mon Mar 15 18:05:20 2004 UserAgent.pm +-rw-r--r--1 aas users2841 Thu Feb 5 19:06:30 2004 media.types +EOL + [EMAIL PROTECTED] = parse_dir($full_time_dir, undef, 'unix'); + +# Pick out the Socket.pm line as the sample we check carefully +($name, $type, $size, $mtime, $mode) = @{$dir[9]}; + +$name eq Socket.pm || print not ; +print ok 7\n; + +$type eq f || print not ; +print ok 8\n; + +$size == 8817 || print not ; +print ok 9\n; + +scalar(localtime($mtime)) eq 'Sat Mar 15 18:05:59 2003' or print not ; +print ok 10\n; + +$mode == 0100644 || print not ; +print ok 11\n; + + +# Test GNU ls version 4.5 -l --full-time format (aka --time-style=full-iso): +$iso_time_dir = 'EOL'; +total 68 +drwxr-xr-x4 aas users1024 2004-03-16 15:47:02.0 -0600 . +drwxr-xr-x 11 aas users1024 2004-03-15 19:22:31.0 -0600 .. +drwxr-xr-x2 aas users1024 2004-03-16 15:47:40.0 -0600 CVS +-rw-r--r--1 aas users2384 2004-02-26 21:14:22.0 -0600
Re: Patch to support --full-time in File::Listing
Gisle Aas writes: The patch did not apply here. Are you patching from a pristine 5.79? Yes. The problem is in the patch file. There needs to be a blank line before the line: --- t/base/listing.tThu Nov 14 07:07:44 1996 Otherwise, it thinks the whole patch applies to lib/File/Listing.pm, which it doesn't. I'm not sure how that happened. Next time, I'll make sure I try applying the patch before submitting. Anyway, this is how --full-time comes out here (Redhat 9). It does not appear to be the same format you try to parse. Yeah, I was working with an older system. The patch is for GNU ls 4.1. They changed the format somewhere between 4.1 and 4.5. (I haven't checked the change logs to find out when.) I encountered this myself, and have a new change that also handles the 4.5 format. I haven't yet assembled a patch to submit. I'm trying to decide if it should also support the --time-style=long-iso option, which looks like this: drwxr-xr-x 2 cjm cjm 4096 2004-06-14 22:13 CD1 drwxr-xr-x 2 cjm cjm 4096 2004-06-14 22:20 CD2 -- Chris Madsen[EMAIL PROTECTED] -- http://www.pobox.com/~cjm --