I've read the thread and I'd like to add my thoughts:
I don't think the discussion regarding whether - is a letter is constructive. 
We have a problem to solve. Right now - is a meta-character indicating a range.

I think we should extend the book name parser to work with Bible book names as 
they occur in other languages and may be input into our front-ends. This 
includes -, non-arabic digits (I think thats what 0-9 are called?) and things 
like ' that if I understand it are representative for things like clicks, 
whistles or glottal stops.

In the case of JSword, it is a tough problem. We split the input into a token 
stream. The splitting is relatively naive and does split on -.

I've thought about how I'd fix it and I have not found a good solution. One 
edge case that is allowed is Gen-Exo. Which is everything from the beginning of 
Genesis to the end of Exodus.

My thought is to take the book names (also abbreviations and alternates) for 
English, the user's locale and the language of the module and build a trie. 
Then a given input is analyzed against the trie for the longest matching 
prefix. As long as the next char is found in the trie we keep going. If the 
next "char" is not in the trie and is a letter then we have an error. If it is 
not a letter we take the match and using the trie find all the matches with 
that prefix. Disambiguation is handled in the usual way.

In SWORD, it'd be easy to knit this kind of recognizer into the parser.

As to numbers, I'd suggest using an ICU number shaper to map all numeric values 
in an input into 0-9. We do this in JSword for Arabic and Farsi and it works 
quite well.

Regarding OSIS, it is a fixed dictionary of internal names for all books. They 
are not meant to be shown to users, even though many would have no problem 
understanding them.

And regarding OSIS, we subject osisRefs and osisIDs to the same parser. I think 
there should be a separate parser, which would be very simple, that would parse 
it into our internal form. If the reference comes out of an OSIS encoded Bible, 
then we could have a great gain.

Here's the rub, someone has to step up and tackle it. The code for SWORD is all 
tucked into a single method. In JSword, it is spread out into a finite state 
automata that is hard to change. It will just have to be replaced.

In Him,
        DM

On Sep 29, 2010, at 4:55 PM, Robert Hunt wrote:

> New Zealand.
> 
> Hello all,
> 
>    I am spending today studying the documentation on the Crosswire Sword wiki 
> so I'm likely to have a few questions. Please let me know if this is not the 
> right forum to ask questions.
> 
>    I see in http://www.crosswire.org/wiki/DevTools:SWORD that localised book 
> names are not allowed hyphens in them (because the hyphen is used for verse 
> ranges). In the Philippine language that we worked with as Bible translators, 
> the hyphen is a letter in the alphabet and appears in several book names!
> 
>    Is this still a current limitation? If so, what is the suggested 
> work-around.
> 
> Thanks,
> Robert.
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to