Re: A Bit of a Strange Situation

Bob Proulx Thu, 25 Aug 2011 15:46:21 -0700

RiverWind wrote:
> You see, the files have a bit of an unconventional extension, to
> wit "cookbook3.html#SEC1 or cookbook14.html#SEC2" and so on. You
> see, the first number before the ".html" I believe designates the
> part, and the number following the "#SEC" indicates the different
> sections in the respective parts of the book.


I see your problem now.  You are being confused by the types of links
used in the document.  This is simply a misunderstanding.  I think I
can clear this up for you.  When I look at this next URL:

  http://www.dsl.org/cookbook/cookbook_toc.html

I see a table of contents in 45 parts.  But each link has an id tag
associated with it to jump into the middle of the part.  Each link
jumps to the sub-section of the chapter.  This is how it is making it
convenient for readers to jump to the sub-part of the document.  But
you should ignore those.  They are not separate files.  They are
anchor tags in the middle of the section.

Let me dive into a little detail of the anchors.  But do keep reading
because after this I will show you how to solve your problem.

Let me repeat the html of the very first link on the page.  This might
confuse your screen reader and if so give me a hint on how I should
represent verbatim html text and I will be happy to do so.

  <A NAME="TOC1" HREF="cookbook_1.html#SEC1">Preface</A>

That generates a link to cookbook_1.html#SEC1 as you already know.
But that "#SEC1" part is simply an anchor with an id attribute to jump
into the middle of a page.

Here let me repeat the html of the part it jumps to:

  <H1><A NAME="SEC1" HREF="cookbook_toc.html#TOC1">Preface</A></H1>

Each sub-section is referenced in this way.
You can read and learn more about these here at this URL to the World
Wide Web Consortium reference documentation page.  It itself uses an
id anchor to jump to the particular part of the document that
references these.

  http://www.w3.org/TR/html4/struct/links.html#h-12.2.3

> This would tend to make the use of wild cards a bit ticklish.

Actually, no.  Even if those were the filenames you could simply match
them with a wildcard with no problem.  But let's not talk about that
for a moment since it isn't important.  Let's help get you going in
the direction of solving your actual problem and not the side tracking
problem.

> If I could just figure around this problem however, I would be in
> business, because html2txt conversions would be easy, and the
> concatenation even easier.

There are 45 links on the page.  They are named and numbered very
regularly.  You can simply write a for-loop to walk over all 45 of
them.  Let me say a three line shell script snippet that will do this
for you.

  for chapternum in $(seq 1 45); do
    wget http://www.dsl.org/cookbook/cookbook_$chapternum.html
  done

Let me describe it with some verbosity hoping that it will make it
easier for your reader.  The 'seq' command generates a sequence of
numbers.  Here I am calling "seq 1 45" to generate the numbers from 1
through 45 inclusive.  Those are called within a dollar-parenthesis
command substitution to place those 45 numbers on the comand line for
the for-loop to iterate over.  Then the for-loop walks through each in
turn setting the variable named "chapternum" to the current index
value.  Then the wget command uses that dollar chapter num variable to
create the URL to pull each chapter in turn.  The "#SEC" parts are not
really in the filename nor should they be in the filename.

Running that three line shell script snippet should produce 45 files
called chapter_1.html through chapter_45.html in the current
directory.  I think at that point you should be okay to convert each
in turn to plain text.

Hope that helps,
Bob

signature.asc
Description: Digital signature

Re: A Bit of a Strange Situation

Reply via email to