[scribus] open source meanderings in creating a large or multiple large reference books.

Gregory Pittman Sat, 17 Aug 2013 11:55:50 -0400

I thought that perhaps others would have some interest, and maybe even 
some advice to me and others about the process (I hesitate to call it a 
workflow, since it was a make-it-up-as-I-went process) in turning a 
large amount of information into books.

What this is/was.
In October, 2014, CMS (Centers for Medicare and Medicaid Services) has
mandated that health providers must begin using the ICD-10, the 10th
version of the International Classification of Diseases. You can get a
flavor of this at the WHO web site:

http://apps.who.int/classifications/icd10/browse/2010/en

Internationally, this is something used simply to keep track of the
occurrences of various diseases and conditions. In the US, it's the
basis of managing the submission of information, so that one can be
paid, not only by CMS, but also private insurance companies. ICD 10 is a
dramatic change from ICD 9, with a massive expansion of codes, not just
with new diseases, but the addition of a lot of circumstances -- check
out the V codes, for example. CMS, being the CMS, wasn't completely
satisfied with the WHO system, so they've modified it, in some areas
quite substantially. Even though the implementation is coming more than
a year from now, I wanted some way to begin to try to absorb these
changes to avoid chaos.

CMS doesn't (yet) have a convenient set of web pages like WHO, but they
do have a downloadable zip file. Here's how this process went.

1. I downloaded the zip file, unzipped. 13MB of pure text.

2. Next, I wanted to create a database from this file, which looks like
a table. I created a Postgresql table with the appropriate column names,
and this was when I found that this text file is just a text file,
consisting of lines of text, separated by spaces. Most fields are single
alphanumeric "words", but the last two are multiple word, variable
length fields.

3. So I wrote a Python script to turn this text file into fields
separated by tabs, so that I could load this into Postgresql.

4. One of the columns of this table is something I labeled "valid".
These have a value of 0 or 1; if 0, this is a heading, subheading, or
sub-subheading. If I were submitting a code to some payor, I cannot use
these (they are not valid for billing), yet they do give a structure to
the overall listing to help find things.

5. The first thing I tried was to export from Postgresql as HTML output.
This becomes a 27MB file (!), very hard to efficiently use with a
browser, and there are no links. Just the headings is 6MB in HTML.

6. Then I thought perhaps creating an ePub with Sigil would be useful.
Take my advice and don't try to import 27MB into Sigil. Even going
category by category pretty quickly became unworkable. Just the S codes
are >38,000 rows. I did actually manage to do it, with a lot of waiting
between operations, sometimes 30 minutes or more. More tedious was the
creation of headings (editing headings and subheadings into h1, h2, h3
and so on). I was hoping that Sigil might have some scripting capability
for this, but it doesn't, so this was a HIGHLY manual operation. There
is the shortcut of simply clicking on a GUI button to convert a <p></p>
line to <h2></h2>, or the keyboard shortcuts of Ctrl+1, Ctrl+2, etc.
This is how I spent most of a weekend.

7. Sigil will generate TOCs for you. Unfortunately, these automatically
generated listings were MASSIVE, and therefore unusable. I really needed
a TOC for the TOC. So I went to the WHO web page, and using this as a
key, created new headings, setting these as h1 and h2 headers, and
everything else below that, h3 and h4. This, too, was a manual operation
(typing). NOW I had a TOC that seemed useful when I told Sigil to only
use h1 and h2.

8. This was unusable on a tablet. Not a memory issue, but the app simply
choked on such a big ePub. Worked Ok on calibre on my desktop. The next
step was to break the massive ePub into 3 ePubs. This was usable but not
necessarily user-friendly, and even these smaller ePubs are a bit clunky.

9. I tried using calibre to convert the ePubs to PDFs, but this is
simply ugly, with very fuzzy text. Unacceptable.

10. This is point where I brought in Scribus. Rather than begin with
plain text, I wanted to make use of the tedious work I'd already done in
Sigil. The starting point is to make a copy of the ePubs, change the
extension to .zip, then unzip. In a subfolder you have your xhtml files.
These imported nicely into Scribus, where I modified the created styles.
In Scribus, I went with a US Letter, 2-column format to get more
information on a page, which was an issue with the ePubs.

11. The next thing was to improve searchability -- you can't simply be
leafing through such a massive amount of information. For this, I went
back to Postgresql, and generated text files consisting only of the
headings and subheadings. I already knew that using all of these for a
TOC or index wasn't good. I decided to filter out the headings that were
only 3 characters, for example, A00, A01, A02, etc. This sounds like a
job for regexp, so I wrote a Perl script to do that. This creates a list
of no more than 100 items, most of the time less, since there are gaps.
G00 Bacterial meningitis, not elsewhere classified
G03 Meningitis due to other and unspecified causes
G04 Encephalitis, myelitis and encephalomyelitis
G05 Encphlts, myelitis & encephalomyelitis in dis classd elswhr
and so on (these heading come from the "short" table item, an
abbreviated version of the "long" version used in the body of the PDF)

12. I was hoping I could use an automated TOC creation in Scribus, but
it doesn't work that way, so I went with the manual process of making
overlying linking frames. Another decision was to break up the ePub into
20-some separate PDFs, to have the A codes in one, then B codes in
another. Even so, the S codes were just too big for Scribus to manage,
so this became S1 and S2 (which were 9.8 MBor 562 pages and 12 MBor 661
pages PDFs respectively). The body of each PDF has a link back to its
index on each page.

13. Individually, these work pretty well, but what was lacking was an
efficient way to jump from one section to the next. Once again, taking a
cue from the WHO web page, I made a 2-page PDF to link to the individual
sections, and of course links from each section to the main index.
A00-A99 Certain infectious and parasitic diseases
B00-B99 Certain infectious and parasitic diseases, continued
C00-C99 Neoplasms
D00-D48 Neoplasms
and so on

14. At this point, I'm pretty satisfied with the results. As I had
stated before in a prior post, external links don't work on my tablet. I
sent an email to Adobe, and they say that has not been implemented, but
they have put it on the list for the future.

Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.scribus.net/pipermail/scribus/attachments/20130817/e0a2d9a8/attachment.html>

[scribus] open source meanderings in creating a large or multiple large reference books.

Reply via email to