The core of it is just a two-line python function:


from itertools import groupby

def tokenize(stream, chr_classes):
    tokenize the given stream based on the given character classes.
    chr_classes should be a dictionary mapping character class label to a
    string of member unicode characters::
            "numbers": u"0123456789",
            "whitespace": u" \n",
    # build reverse index from character to character class
    idx = dict((ch, chr_class) for chr_class, chrs in chr_classes.items() for 
ch in chrs)
    # tokenize text
    return groupby((ch for line in stream for ch in line.decode("utf-8")), 


So, for example you set up your character classes (yes, I could have just 
defined ranges but, in my code last night, I was being explicit about the 
character appearing in the particular text I was tokenizing)


def u(s):
    convert utf-8 encoded string to unicode.
    return s.decode("utf-8")

    "editorial": u("[]‹›()"),
    "letters": u(
        "βγδζθκλμνξπρσςτφχψ" "αεηιουω" "ῥῤ"
        "ἀἁάὰᾶἄἅἂᾳᾷἃἆᾴᾄ" "ἐἑέὲἔἕἓ" "ἡήὴῆἢἤῃῄῇἥἦἧᾐἠᾖἣᾗ"
        "ἰἱίὶῖἶἷἴἵϊἳΐῒ" "ὀὁόὸὅὃὄὂ" "ὐὑύὺῦὔὖὕὓ"
    "whitespace": u(" \n"),
    "numbers": u("1234567890"),
    "punctuation": u(".,·;“”"),
    "temp": u("†-"),


and then you're good to go...


import sys
FILENAME = sys.argv[1]

for chr_class, token in tokenize(open(FILENAME), CHR_CLASSES):
    print "".join(token).encode("utf-8"), chr_class



On Apr 5, 2010, at 1:44 PM, Weston Ruter wrote:

> DM:
> But what we really need is not a parser but a tokenizer. I'm thinking about 
> writing one (my degree work was in compiler writing). Basically, we repeat 
> the same tokenization code in several places. It should be trivial to write a 
> complete, accurate one.
> I've also been wanting to work on a tokenizer. At Open Scriptures, the text 
> of a work is currently represented by two models (database tables): Token and 
> Structure. Tokens are the smallest divisible units of text, such as words, 
> punctuation, and whitespace; and structures are the spans of tokens that form 
> logical units, such as verses, paragraphs, quotes, etc. The structures are 
> standoff-markup for the tokens. With the underlying data stored in this way, 
> it can then be serialized in whichever hierarchy desired 
> (book-section-paragraph, book-chapter-verse, all-milestoned, etc) or 
> whichever data format is needed (OSIS, SWORD Module, XHTML, etc.)
> So what I'm currently rumenating on is the process of importing the raw data 
> into the Token and Structure models. I wrote an importer for the Tischendorf 
> GNT data which does everything both tokenizing and parsing, but obviously 
> there is going to be a lot of code in common with other importers that are 
> written. So I too am thinking about how these importers can be reduced to the 
> bare minimum to handle the unique aspects of the raw data (i.e. normalize 
> it), and then stream the tokens back to a central importer that parses the 
> input and stores it into the Token and Structure models. This central 
> importer facility could be a web service.
> I've love to collaborate with you on this. We could come up with a common 
> tokenizer that can be used by both SWORD and Open Scriptures. The importer 
> web service could take tokens as input and as output generate a SWORD module 
> and also populate the Open Scriptures models at the same time.
> Thoughts?
> Weston
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <> wrote:
> Yes, I agree, and if there were a feedback mechanism for the module creator 
> to let them know how to start fixing an OSIS file or conf file, it would save 
> Chris (or whoever else approves modules) time on the basic stuff.
> Daniel
> On 4/5/2010 11:09 AM, DM Smith wrote:
> This is a great idea. Rather than emailing source to modules at crosswire dot 
> org, one could upload it via a web service. We could have stages of 
> validation (xmllint) and construction (osis2mod). Such a service could 
> evaluate the quality of the submission.
> In Him,
>    DM
> On 04/05/2010 12:01 PM, Weston Ruter wrote:
> Why not turn osis2mod into a web service? Then it wouldn't matter how it is 
> implemented since it would be abstracted away by the web service interface. 
> It could use the best XML libraries available today and written in the 
> programming language of choice, both of which would make maintenance and the 
> addition of new features much easier.
> Weston
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <> wrote:
> On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:
> On 5 April 2010 13:55, Manfred Bergmann<>  wrote:
> Hi DM.
> Am 05.04.2010 um 13:21 schrieb DM Smith:
> Regarding using a "real" parser, it is a good idea. But we don't want SWORD 
> to be dependant on an external parser.
> What's the reason for that?
> I could understand if it would mean for the user to install certain libraries 
> manually but when the sources can be integrated into the project and has the 
> appropriate licence then why not?
> Manfred
> IMHO there is no harm in bringing in libxml or a much more lightweight
> parser like GMarkup. The build system just needs to be adjusted to
> link e.g. libxml for the osis2mod binary and not shared sword library.
> in can be even called a new tool osisxml2mod for example and make it
> be build optionally such that you can still have full sword dev
> environment without libxml.
> Tools for creating modules do not have be linked with sword or even
> live in sword taball / svn. Although it does help consistent
> distribution of tools.
> I don't remember all of Troy's reasoning when I argued for a true parser.
> From what I recall:
> o To maintain freedom to re-license SWORD (e.g. for some other Bible society) 
> we need to be able to keep 3-rd party library dependencies well managed. The 
> license needs to be compatible with the GPL but cannot be GPL.
> o The parser that we have is minimal and simple, sacrificing accuracy and 
> completeness for speed. Regarding accuracy, e.g. the parser allows for spaces 
> around = in attribute declarations. Regarding completeness, e.g. it does not 
> handle namespaces, cdata, dtds/schemas, .... Significantly, it does not 
> require a well-formed document, allowing for fragments. Rather than an error, 
> it continues when an xml parser is required to stop.
> o This parser has better error reporting in that it is based upon knowledge 
> of the input. E.g. it reports the verse having the problem.
> o By SWORD having the parser, we are not dependent on finding an 
> implementation for every platform (e.g. Windows).
> There may be other reasons. I'm willing to live with it.
> But what we really need is not a parser but a tokenizer. I'm thinking about 
> writing one (my degree work was in compiler writing). Basically, we repeat 
> the same tokenization code in several places. It should be trivial to write a 
> complete, accurate one.
> In His Service,
>    DM
