Re: HTML-Tree gets weakref support!

2012-06-12 Thread Christopher J. Madsen
HTML-Tree 5.00 is now available on CPAN:
https://metacpan.org/release/HTML-Tree

Release announcement:
http://blogs.perl.org/users/christopher_j_madsen/2012/06/html-tree-5.html

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



HTML-Tree gets weakref support!

2012-06-01 Thread Christopher J. Madsen
At long last, HTML-Tree can now use weakrefs.  This means you no longer
need to call $tree-delete to prevent memory leaks.

Right now, you can get trial release 4.900 from CPAN:
https://metacpan.org/release/CJM/HTML-Tree-4.900-TRIAL/

I plan to release 5.00 before YAPC::NA, probably on June 12.  (This
would probably be a good subject for a lightning talk, if someone wants
to give it.  I'd do it myself, but I'm not going to be able to make it
to YAPC this year.)

If you're using HTML-Tree, please test your code with the new version now.

While most programs should continue to work fine with this change, it
could break your code.  The one real-world example I've found so far is
pQuery's dom.t.  In pQuery 0.08, it does:

  my @elems = pQuery::DOM-fromHTML('divxxx!-- yyy --zzz/div')
 -childNodes;
  my $comment = $elems[1];
  is $comment-parentNode-tagName, 'DIV', 'Comment has parentNode';

Notice that it's not saving the result of the fromHTML call; only the
child nodes.  Since children now have only a weak reference to their
parent, the root node is deleted immediately, and $comment-parentNode
is undef.

This can be fixed by saving a reference to the root node:

  my @elems = (my $r = pQuery::DOM
   -fromHTML('divxxx!-- yyy --zzz/div'))
   -childNodes;

As a quick fix for broken code (and to determine whether it's the weak
references that are causing the breakage), you can say:

  use HTML::Element -noweak;

This (globally) disables HTML-Tree's use of weak references.  But this
is just a temporary measure.  You need to fix your code, because this
feature will be going away eventually.

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
I stumbled across this bug:

  https://rt.cpan.org/Ticket/Display.html?id=66313

and a discussion here about removing HTTP::Message's dependency on
HTML::Parser (which needs a C compiler) for charset sniffing.

As it happens, I'm about to release a new dist that implements the HTML5
encoding sniffing algorithm in pure-Perl with no non-core dependencies
for 5.8+.  While its primary function is to make it dead simple to open
a HTML file and get the right encoding layer applied automatically, it
also exposes the underlying mechanism.

My repo is https://github.com/madsen/io-html but since it's built with
dzil, I also made a gist of the processed module to make it easier to
read the docs: https://gist.github.com/1623654

I took a quick look at HTTP::Message, and I think you'd just need to do

elsif ($self-content_is_html) {
require IO::HTML;
my $charset = IO::HTML::find_charset_in($$cref);
return $charset if $charset;
}

You're already doing the BOM and valid-UTF8 checks; all you need is the
meta check, which is what find_charset_in does.

One possible issue is that find_charset_in returns Perl's canonical name
for the encoding, which is not necessarily the same as the  MIME or IANA
charset name.  You could do

  return Encode::find_encoding($charset)-mime_name if $charset;

if you want.

I'm planning to release this in a week or so, after I see if any more
bugs pop up or I think of any API changes I should make.

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 6:53 PM, Bjoern Hoehrmann wrote:
 * Christopher J. Madsen wrote:
 My repo is https://github.com/madsen/io-html but since it's built with
 dzil, I also made a gist of the processed module to make it easier to
 read the docs: https://gist.github.com/1623654

 It is not clear to me that the combination would actually conform to the
 HTML5 proposal, for instance, HTTP::Message seems to recognize UTF-32
 BOMs, but as I recall the HTML5 proposal does not allow that.

Dropping support for UTF-32 from HTTP::Message is a separate issue from
removing HTML::Parser.  I've got no comment on that.

 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

I guess the RE could be improved, but I'm not sure it's worth the effort
and added complication to catch a tiny fraction of false positives.

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

I thought it already did that?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote:
 * Christopher J. Madsen wrote:
 Your UTF-8 validation code seems wrong to me, you consider the sequence
 F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
 the chart in http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#design.

 I guess the RE could be improved, but I'm not sure it's worth the effort
 and added complication to catch a tiny fraction of false positives.
 
 Why make the check at all if you don't care if it's right?

I can't use a simple utf8::decode check, because I read a fixed number
of bytes, and that might have cut a multi-byte character in half.  So I
use Encode::FB_QUIET, and then check the leftovers to make sure that
it's a single, plausible, partial UTF-8 character.  I have to check the
leftovers, or the whole test would be meaningless.  I just make sure
it's a start byte followed by an appropriate number of continuation bytes.

As you say, certain start bytes can't validly be followed by certain
continuation bytes, but writing an RE for those rules is more complexity
than I think the problem warrants.  What are the odds that I had 1021
bytes of valid UTF-8 (including at least 1 multi-byte character)
followed by bytes that match my current RE but a strict test could have
rejected?  I'm already just assuming that the next bytes would be
additional continuation bytes.

 Anyway, if people think this is the way to go, maybe HTTP::Message can
 adopt the Content-Type header charset extraction tests in HTML::Encoding
 so they don't get lost as my module becomes redundant?

 I thought it already did that?
 
 Not as far as I can tell; links welcome though.

At the beginning of content_charset, it calls content_type_charset
(which is actually a HTTP::Headers method).

Or were you talking about t/01http.t and its associated input files?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



[PATCH] Updated support for --full-time in File::Listing

2004-06-18 Thread Christopher J. Madsen
Attached is a patch against LWP 5.800 (or 5.79, the files didn't change)
to allow File::Listing to interpret the output of GNU ls's --full-time
option (tested with GNU ls 4.1 and 4.5, which unfortunately have
completely different formats).  This allows you to get timestamps
accurate to the second, instead of the minute-based ones you get with a
normal ls -l.

It also handles BSD ls's -T option (which is similar to GNU ls 4.1's
--full-time option).  (Thanks to Ville Skyttä, who sent an example of
a BSD listing, confirming that it does look like the manpage said.)

The new time formats are recognized automatically; you just call
parse_dir like you normally would.

This time I made sure to test the patch file; it should apply cleanly.

Please CC me on responses; I'm not subscribed to the list.

-- 
Chris Madsen[EMAIL PROTECTED]
  --  http://www.pobox.com/~cjm  --
--- c:/tmp/lib/File/Listing.pm  Sun Oct 26 08:24:22 2003
+++ lib/File/Listing.pm Tue Jun 15 15:33:10 2004
@@ -144,7 +144,9 @@
 .*   # Graps
 \D(\d+)  # File size
 \s+  # Some space
-(\w{3}\s+\d+\s+(?:\d{1,2}:\d{2}|\d{4}))  # Date
+(\w{3}\s+\d+\s+(?:\d{1,2}:\d{2}|\d{4}) | # Date
+ \w{3}\s+(?:\w{3}\s+)?\d+\s+\d{1,2}:\d{2}(?::\d{2})?\s+\d{4} | # or Full date
+ \d{4}-\d\d-\d\d\s+\d{1,2}:\d\d(?::\d\d(?:\.\d+)?)?(?:\s+[-+]\d{4})?) # or 
ISO date
 \s+  # Some more space
 (.*)$# File name
/x )
@@ -371,10 +373,12 @@
 =head1 DESCRIPTION

 This module exports a single function called parse_dir(), which can be
-used to parse directory listings. Currently it only understand Unix
-C'ls -l' and C'ls -lR' format.  It should eventually be able to
-most things you might get back from a ftp server file listing (LIST
-command), i.e. VMS listings, NT listings, DOS listings,...
+used to parse directory listings. Currently it only understands Unix
+C'ls -l' and C'ls -lR' format.  It also understands the
+C--full-time option of GNU Bls and the C-T option of BSD Bls.  It
+should eventually be able to parse most things you might get back from
+a ftp server file listing (LIST command), i.e. VMS listings, NT
+listings, DOS listings,...

 The first parameter to parse_dir() is the directory listing to parse.
 It can be a scalar, a reference to an array of directory lines or a

--- c:/tmp/t/base/listing.t Thu Nov 14 07:07:44 1996
+++ t/base/listing.tThu Jun 17 10:55:10 2004
@@ -1,4 +1,4 @@
-print 1..6\n;
+print 1..26\n;


 use File::Listing;
@@ -84,3 +84,167 @@

 $mode == 0100644 || print not ;
 print ok 6\n;
+
+
+# Test GNU ls version 4.1 -l --full-time format:
+$full_time_dir = 'EOL';
+total 68
+drwxr-xr-x4 aas  users1024 Tue Mar 16 15:47:02 2004 .
+drwxr-xr-x   11 aas  users1024 Mon Mar 15 19:22:31 2004 ..
+drwxr-xr-x2 aas  users1024 Tue Mar 16 15:47:40 2004 CVS
+-rw-r--r--1 aas  users2384 Thu Feb 26 21:14:22 2004 Debug.pm
+-rw-r--r--1 aas  users2145 Thu Feb 26 20:09:12 2004 IO.pm
+-rw-r--r--1 aas  users3960 Mon Mar 15 18:05:50 2004 MediaTypes.pm
+-rw-r--r--1 aas  users 792 Thu Feb 26 20:12:44 2004 MemberMixin.pm
+drwxr-xr-x3 aas  users1024 Mon Mar 15 18:05:33 2004 Protocol
+-rw-r--r--1 aas  users5613 Thu Feb 26 20:16:22 2004 Protocol.pm
+-rw-r--r--1 aas  users5963 Wed Feb 26 21:27:11 2003 RobotUA.pm
+-rw-r--r--1 aas  users5071 Tue Mar 16 12:25:58 2004 Simple.pm
+-rw-r--r--1 aas  users8817 Sat Mar 15 18:05:59 2003 Socket.pm
+-rw-r--r--1 aas  users2121 Tue Feb  5 14:22:00 2002 TkIO.pm
+-rw-r--r--1 aas  users   19628 Mon Mar 15 18:05:20 2004 UserAgent.pm
+-rw-r--r--1 aas  users2841 Thu Feb  5 19:06:30 2004 media.types
+EOL
+
[EMAIL PROTECTED] = parse_dir($full_time_dir, undef, 'unix');
+
+# Pick out the Socket.pm line as the sample we check carefully
+($name, $type, $size, $mtime, $mode) = @{$dir[9]};
+
+$name eq Socket.pm || print not ;
+print ok 7\n;
+
+$type eq f || print not ;
+print ok 8\n;
+
+$size == 8817 || print not ;
+print ok 9\n;
+
+scalar(localtime($mtime)) eq 'Sat Mar 15 18:05:59 2003' or print not ;
+print ok 10\n;
+
+$mode == 0100644 || print not ;
+print ok 11\n;
+
+
+# Test GNU ls version 4.5 -l --full-time format (aka --time-style=full-iso):
+$iso_time_dir = 'EOL';
+total 68
+drwxr-xr-x4 aas  users1024 2004-03-16 15:47:02.0 -0600 .
+drwxr-xr-x   11 aas  users1024 2004-03-15 19:22:31.0 -0600 ..
+drwxr-xr-x2 aas  users1024 2004-03-16 15:47:40.0 -0600 CVS
+-rw-r--r--1 aas  users2384 2004-02-26 21:14:22.0 -0600 

Re: Patch to support --full-time in File::Listing

2004-06-17 Thread Christopher J. Madsen
Gisle Aas writes:
  The patch did not apply here.  Are you patching from a pristine 5.79?

Yes.  The problem is in the patch file.  There needs to be a blank
line before the line:
--- t/base/listing.tThu Nov 14 07:07:44 1996

Otherwise, it thinks the whole patch applies to lib/File/Listing.pm,
which it doesn't.  I'm not sure how that happened.  Next time, I'll
make sure I try applying the patch before submitting.

  Anyway, this is how --full-time comes out here (Redhat 9).  It does
  not appear to be the same format you try to parse.

Yeah, I was working with an older system.  The patch is for GNU ls
4.1.  They changed the format somewhere between 4.1 and 4.5. (I
haven't checked the change logs to find out when.)

I encountered this myself, and have a new change that also handles the
4.5 format.  I haven't yet assembled a patch to submit.  I'm trying to
decide if it should also support the --time-style=long-iso option,
which looks like this:

drwxr-xr-x  2 cjm cjm 4096 2004-06-14 22:13 CD1
drwxr-xr-x  2 cjm cjm 4096 2004-06-14 22:20 CD2

-- 
Chris Madsen[EMAIL PROTECTED]
  --  http://www.pobox.com/~cjm  --