I thought that perhaps others would have some interest, and maybe even some advice to me and others about the process (I hesitate to call it a workflow, since it was a make-it-up-as-I-went process) in turning a large amount of information into books.
What this is/was. In October, 2014, CMS (Centers for Medicare and Medicaid Services) has mandated that health providers must begin using the ICD-10, the 10th version of the International Classification of Diseases. You can get a flavor of this at the WHO web site: http://apps.who.int/classifications/icd10/browse/2010/en Internationally, this is something used simply to keep track of the occurrences of various diseases and conditions. In the US, it's the basis of managing the submission of information, so that one can be paid, not only by CMS, but also private insurance companies. ICD 10 is a dramatic change from ICD 9, with a massive expansion of codes, not just with new diseases, but the addition of a lot of circumstances -- check out the V codes, for example. CMS, being the CMS, wasn't completely satisfied with the WHO system, so they've modified it, in some areas quite substantially. Even though the implementation is coming more than a year from now, I wanted some way to begin to try to absorb these changes to avoid chaos. CMS doesn't (yet) have a convenient set of web pages like WHO, but they do have a downloadable zip file. Here's how this process went. 1. I downloaded the zip file, unzipped. 13MB of pure text. 2. Next, I wanted to create a database from this file, which looks like a table. I created a Postgresql table with the appropriate column names, and this was when I found that this text file is just a text file, consisting of lines of text, separated by spaces. Most fields are single alphanumeric "words", but the last two are multiple word, variable length fields. 3. So I wrote a Python script to turn this text file into fields separated by tabs, so that I could load this into Postgresql. 4. One of the columns of this table is something I labeled "valid". These have a value of 0 or 1; if 0, this is a heading, subheading, or sub-subheading. If I were submitting a code to some payor, I cannot use these (they are not valid for billing), yet they do give a structure to the overall listing to help find things. 5. The first thing I tried was to export from Postgresql as HTML output. This becomes a 27MB file (!), very hard to efficiently use with a browser, and there are no links. Just the headings is 6MB in HTML. 6. Then I thought perhaps creating an ePub with Sigil would be useful. Take my advice and don't try to import 27MB into Sigil. Even going category by category pretty quickly became unworkable. Just the S codes are >38,000 rows. I did actually manage to do it, with a lot of waiting between operations, sometimes 30 minutes or more. More tedious was the creation of headings (editing headings and subheadings into h1, h2, h3 and so on). I was hoping that Sigil might have some scripting capability for this, but it doesn't, so this was a HIGHLY manual operation. There is the shortcut of simply clicking on a GUI button to convert a <p></p> line to <h2></h2>, or the keyboard shortcuts of Ctrl+1, Ctrl+2, etc. This is how I spent most of a weekend. 7. Sigil will generate TOCs for you. Unfortunately, these automatically generated listings were MASSIVE, and therefore unusable. I really needed a TOC for the TOC. So I went to the WHO web page, and using this as a key, created new headings, setting these as h1 and h2 headers, and everything else below that, h3 and h4. This, too, was a manual operation (typing). NOW I had a TOC that seemed useful when I told Sigil to only use h1 and h2. 8. This was unusable on a tablet. Not a memory issue, but the app simply choked on such a big ePub. Worked Ok on calibre on my desktop. The next step was to break the massive ePub into 3 ePubs. This was usable but not necessarily user-friendly, and even these smaller ePubs are a bit clunky. 9. I tried using calibre to convert the ePubs to PDFs, but this is simply ugly, with very fuzzy text. Unacceptable. 10. This is point where I brought in Scribus. Rather than begin with plain text, I wanted to make use of the tedious work I'd already done in Sigil. The starting point is to make a copy of the ePubs, change the extension to .zip, then unzip. In a subfolder you have your xhtml files. These imported nicely into Scribus, where I modified the created styles. In Scribus, I went with a US Letter, 2-column format to get more information on a page, which was an issue with the ePubs. 11. The next thing was to improve searchability -- you can't simply be leafing through such a massive amount of information. For this, I went back to Postgresql, and generated text files consisting only of the headings and subheadings. I already knew that using all of these for a TOC or index wasn't good. I decided to filter out the headings that were only 3 characters, for example, A00, A01, A02, etc. This sounds like a job for regexp, so I wrote a Perl script to do that. This creates a list of no more than 100 items, most of the time less, since there are gaps. G00 Bacterial meningitis, not elsewhere classified G03 Meningitis due to other and unspecified causes G04 Encephalitis, myelitis and encephalomyelitis G05 Encphlts, myelitis & encephalomyelitis in dis classd elswhr and so on (these heading come from the "short" table item, an abbreviated version of the "long" version used in the body of the PDF) 12. I was hoping I could use an automated TOC creation in Scribus, but it doesn't work that way, so I went with the manual process of making overlying linking frames. Another decision was to break up the ePub into 20-some separate PDFs, to have the A codes in one, then B codes in another. Even so, the S codes were just too big for Scribus to manage, so this became S1 and S2 (which were 9.8 MBor 562 pages and 12 MBor 661 pages PDFs respectively). The body of each PDF has a link back to its index on each page. 13. Individually, these work pretty well, but what was lacking was an efficient way to jump from one section to the next. Once again, taking a cue from the WHO web page, I made a 2-page PDF to link to the individual sections, and of course links from each section to the main index. A00-A99 Certain infectious and parasitic diseases B00-B99 Certain infectious and parasitic diseases, continued C00-C99 Neoplasms D00-D48 Neoplasms and so on 14. At this point, I'm pretty satisfied with the results. As I had stated before in a prior post, external links don't work on my tablet. I sent an email to Adobe, and they say that has not been implemented, but they have put it on the list for the future. Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.scribus.net/pipermail/scribus/attachments/20130817/e0a2d9a8/attachment.html>
