Re: Project Gutenberg

David A. Desrosiers Sat, 24 Jul 2004 01:28:54 -0700

> > The only problem is I find I have to tweak them for different books.


> All right, then...what about trying some of the heuristic algorithms other
> people have been cooking up?

        Here's something I cooked up (with a small bit of help from another
Perl person I dragged into this to pick his brain for some ideas). Maybe
this will give others here some ideas to work with.

        This takes a text-file as an argument (./foo.pl /tmp/8584.txt), and
will split the chapters found in it into separate .html files, and wrap each
chapter's content into pseudo-HTML. It adds a clickable "Next chapter" link
at the bottom of each page created. It definately needs to be "smarter", but
this was only an hour or so of work and testing. The important part here is
the abliity to split the text file up by chapter. With that in place, a lot
more can be done to make the system "intelligent" enough to deal with
multiple different kinds of input formats, adding a clickable Table of
Contents for each converted book, and so on.

        It's a start.. if I get some more time, I'll embellish it some more
and new features. I have a few more ideas I'd like to drop into this one,
based on some of my other production spidering/conversion code running in
other places.


#####################################################################
# A perl script to convert Project Gutenberg ebooks to HTML format
# suitable for Plucking and reading on a Palm handheld device.
#
# Copyright 1994-2004 David A. Desrosiers. All Rights Reserved.
#
# Permission to use, copy, modify, and distribute this software
# is hereby granted without fee, provided that the copyright
# notice and permission notice are not removed.
#
# Last modified: 7/24/2004
#
# TODO:
# ------------------------
# - Add the heuristics for the various PG editor styles
#
# Tested with "Roughing It, Part 3., by Mark Twain (Samuel Clemens)"
# from ftp://ftp.archive.org/pub/etext/8/5/8/8584/8584.txt

use strict;
use CGI;

my $cgi = CGI->new();
my $title;

my @special_cases = (
        # replace 5 consecutive blank lines with a horizontal rule
        [ "\r\n\r\n\r\n\r\n\r" => "\n<hr />\n", ],

        # replace 3 blank lines with a wide spacer.
        [ "\r\n\r\n\r\n" => "\n<p><br /><\/p>\n", ],

        # replace 2 consecutive ^Ms with a <p>
        [ "\cM\cM" => "<p>\n\n", ],

        # Turn paragraphing on, this needs work
        [ "\cM\n\cM\n" => "</p>\n\n<p>", ],

        # Unwrap paragraphs into their own lines
        [ "\cM\n" => " ", ],
);

# get past the Gutenberg boilerplate stuff, reading a line at a time:
my @begin_boiler;
while (<>) {
        push @begin_boiler, $_;
        last if ( /^\*\*\* START OF THIS PROJECT GUTENBERG EBOOK/ );
}
my $begin_boilertxt = join '', @begin_boiler;

$/ = "\nCHAPTER ";  # from now on we'll read a chapter at a time

my $preamble = <>;  # everything before "CHAPTER I" (in the case of 86.txt)
my @chapters = <>;

# We now have all the data in $begin_boilertext, $preamble and @chapters
# Note that $preamble and @chapters (except the last one) all end with
# "\nCHAPTER " -- don't need that anymore, and since that is what $/
# is set to, just chomp 'em:
chomp $preamble;
chomp @chapters;  # no impact on last chapter, of course.

# Find the Gutenberg end-of-file boilerplate stuff:
my $lastch = $#chapters;
my $end_boiler = rindex( $chapters[$lastch],
        "*** END OF THIS PROJECT GUTENBERG EBOOK" );

my $end_boilertext = substr( $chapters[$lastch], $end_boiler );
$chapters[$lastch] = substr( $chapters[$lastch], 0, $end_boiler );

my $chapter_id = 0;
for ($preamble, @chapters) {
        $_ = "<p>CHAPTER " . $_;
        $_ = $cgi->start_html(-title=>"$title") . $_;

        my $fname = sprintf( "%0.3d.html", $chapter_id++ );
        my $prev_chapter = sprintf( "%0.3d.html", $chapter_id-- );
        my $next_chapter = sprintf( "%0.3d.html", $chapter_id++ + 1 );

        # This next bit needs some work, so it rolls around from
        # the last chapter back to the first, etc.
        $_ = $_ . "<a href=\"$next_chapter\">Next chapter</a>";
        $_ = $_ . $cgi->hr();
        $_ = $_ . $cgi->end_html;

        open( CH, ">$fname" ) or die "can't write to $fname: $!";
        print CH add_html($_); ;
        close CH;
}

#######################################
#
# Wrap the plain text in html tags
#
#######################################
sub add_html {
        my $text = shift;

        foreach my $case (@special_cases){
                $text =~ s,\Q$case->[0]\E,$case->[1],g;
                $text =~ s,(\w)\s+(\w),$1 $2,g;
                $text =~
                   s,Produced by (\w.*),<b>Produced by</b>: <i>$1</i>,;
                $title =~ m/Project Gutenberg's (\w.*)/;

                # Replace "spoken text" with <i>"spoken text"</i>"
                #
                # This will break horribly, if someone forgot to balance
                # their quotes properly. For now, it is disabled, unless
                # you REALLY REALLY know your text is properly-balanced.
                #
                # $text =~ s,"(.*?)",<i>"$1"</i>,gs;
        }
        return $text;
}


d.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Project Gutenberg

Reply via email to