Hi DM:

This is a lot lower than what an xml tokenizer needs. This would be a
> tokenizer for the text between tags. Having a single tokenizer that does
> both would be more efficient when both are wanted and slower when only xml
> tokens are needed. ¶ I think a model could be constructed that could do both
> and allow one to ask for the depth of tokenization that is needed.

Good idea. So while a tokenizer should support atomic granularity, for many
purposes this is overkill. What would the levels of specificity be?
Verse-level, word-level, and then atomic-level? Atoms would include
whitespace and punctuation marks that aren't currently marked up with <w>

There is a big complication with the parsing of text: it is language
> dependent. For example, Thai has words but not word breaks. Basically, the
> task will require a Unicode and somewhat language aware word-break
> algorithm. The best I've seen is in ICU.

Yes, the text tokenizers would need to be language dependent, but the parser
would not be, correct? While osis2mod does try to make it so that there is
only one tokenizer & parser needed for any text, it currently doesn't
support all of OSIS as has been discussed; and the amount of logic needed in
that single osis2mod package is apparently getting overwhelming. And it also
requires that authors convert into OSIS if they haven't already. Instead of
requiring authors to convert their raw data formats into OSIS which then get
converted into a SWORD Module, what about if authors could ‘just’ write a
script that parses the raw data for tokens and then streams these directly
to a text-independent parser which then can generate the SWORD Module, etc?
This standard common parser could be available as a web service or
downloaded as a local library. This would would eliminate the need for
osis2mod to account for every possible permutation of an OSIS document,
because the author's tokenizer would normalize the input into a consistent
stream of tokens, e.g. start_verse, start_paragraph, word, punctuation,
space, line_start, end_quote, etc. And there would be a separation of
concerns to make the import process more modular.

So to summarize, the idea is to break the text import process into two
steps: tokenizing and parsing. As much import logic as possible would be
moved to the common standard parser, and the tokenizers would only have to
deal with the unique aspects of the text; there could be a standard library
of tokenizer helpers too. Furthermore, there could be standard ready-made
OSIS tokenizers made available which could handle the various permutations
of OSIS (e.g. BSP, BCV), or they could be customized if the OSIS data isn't
normalized enough. The interface between the tokenizer and the parser would
be the token stream that the tokenizer would feed the parser. Breaking down
the text import process into smaller special-purpose scripts which respect
the separation of concerns should make the import task more manageable and
would reduce the need for a single monolithic importer.



