RiverWind wrote: > You see, the files have a bit of an unconventional extension, to > wit "cookbook3.html#SEC1 or cookbook14.html#SEC2" and so on. You > see, the first number before the ".html" I believe designates the > part, and the number following the "#SEC" indicates the different > sections in the respective parts of the book.
I see your problem now. You are being confused by the types of links used in the document. This is simply a misunderstanding. I think I can clear this up for you. When I look at this next URL: http://www.dsl.org/cookbook/cookbook_toc.html I see a table of contents in 45 parts. But each link has an id tag associated with it to jump into the middle of the part. Each link jumps to the sub-section of the chapter. This is how it is making it convenient for readers to jump to the sub-part of the document. But you should ignore those. They are not separate files. They are anchor tags in the middle of the section. Let me dive into a little detail of the anchors. But do keep reading because after this I will show you how to solve your problem. Let me repeat the html of the very first link on the page. This might confuse your screen reader and if so give me a hint on how I should represent verbatim html text and I will be happy to do so. <A NAME="TOC1" HREF="cookbook_1.html#SEC1">Preface</A> That generates a link to cookbook_1.html#SEC1 as you already know. But that "#SEC1" part is simply an anchor with an id attribute to jump into the middle of a page. Here let me repeat the html of the part it jumps to: <H1><A NAME="SEC1" HREF="cookbook_toc.html#TOC1">Preface</A></H1> Each sub-section is referenced in this way. You can read and learn more about these here at this URL to the World Wide Web Consortium reference documentation page. It itself uses an id anchor to jump to the particular part of the document that references these. http://www.w3.org/TR/html4/struct/links.html#h-12.2.3 > This would tend to make the use of wild cards a bit ticklish. Actually, no. Even if those were the filenames you could simply match them with a wildcard with no problem. But let's not talk about that for a moment since it isn't important. Let's help get you going in the direction of solving your actual problem and not the side tracking problem. > If I could just figure around this problem however, I would be in > business, because html2txt conversions would be easy, and the > concatenation even easier. There are 45 links on the page. They are named and numbered very regularly. You can simply write a for-loop to walk over all 45 of them. Let me say a three line shell script snippet that will do this for you. for chapternum in $(seq 1 45); do wget http://www.dsl.org/cookbook/cookbook_$chapternum.html done Let me describe it with some verbosity hoping that it will make it easier for your reader. The 'seq' command generates a sequence of numbers. Here I am calling "seq 1 45" to generate the numbers from 1 through 45 inclusive. Those are called within a dollar-parenthesis command substitution to place those 45 numbers on the comand line for the for-loop to iterate over. Then the for-loop walks through each in turn setting the variable named "chapternum" to the current index value. Then the wget command uses that dollar chapter num variable to create the URL to pull each chapter in turn. The "#SEC" parts are not really in the filename nor should they be in the filename. Running that three line shell script snippet should produce 45 files called chapter_1.html through chapter_45.html in the current directory. I think at that point you should be okay to convert each in turn to plain text. Hope that helps, Bob
signature.asc
Description: Digital signature