Re: [sword-devel] EMTV text source URL is now unrelated

2011-10-12 Thread David Haslam
If Troy started with the HTML as the files to preprocess, what was the main difficulty that prevented tagging the words in italics, such that in the OSIS XML files these would be marked as transChange elements? If the HTML does contain the italics, then the conversion of tags should be amenable to

[sword-devel] DevTools:ICU & Normalization?

2011-10-12 Thread David Haslam
According to http://crosswire.org/wiki/DevTools:ICU - Sword makes use of ICU for casing (used in search), normalization, and script transliteration. *Which version of Unicode do we employ for Normalization to NFC ?* Some composite glyphs that use two combining characters in the *Myanmar* block ar

Re: [sword-devel] DevTools:ICU & Normalization?

2011-10-12 Thread Greg Hellings
David, SWORD can link against many different versions of the ICU library. It will detect the version that is installed on the system and leverage its internal libraries. I know it supports back at least as far as ICU 4.0 which was Unicode 5.1. It also compiles against ICU 4.8 - which supports Unic

Re: [sword-devel] EMTV text source URL is now unrelated

2011-10-12 Thread troypulk
As far as I know there is no tool that converts HTML or .Doc file to a OSIS bible. So using the HTML to convert to a .txt file caused the italics to disappear, but I'm putting them all back in. As I was fixing the RED letters I noticed the lack of tags as well, but these are all fixed now. Troy

Re: [sword-devel] EMTV text source URL is now unrelated

2011-10-12 Thread David Haslam
Hi Troy, Yes - you're probably right about lack of a readily available tool for direct conversion. Had I been tackling the task, I might have considered these steps: 1. Open each HTML file using MS Word, save each file as RTF. 2. Open each RTF file using WordPad, save again as RTF (smaller and s

Re: [sword-devel] EMTV text source URL is now unrelated

2011-10-12 Thread Greg Hellings
On Wed, Oct 12, 2011 at 2:18 PM, David Haslam wrote: > Hi Troy, > > Yes - you're probably right about lack of a readily available tool for > direct conversion. > > Had I been tackling the task, I might have considered these steps: > > 1. Open each HTML file using MS Word, save each file as RTF. >