On 11 Mar 2014, at 19:26, Richmond <richmondmathew...@gmail.com> wrote:
> 
> Well; in theory that looks good until you start to think about languages 
> which are
> written (such as Sanskrit) with no obvious word boundaries and both vowel 
> mutation (Sandhi)
> at what would be word boundaries, and consonant fusion.

The library that we use for low-level Unicode stuff (ICU) provides a facility 
called "break iterators" - basically, these functions break up text according 
to various rules and variants are provided for graphemes, words, sentences, 
etc. ICU has a (very large) database of rules and (for some languages) 
dictionaries in order to properly break words even in complex languages. Not 
all languages are supported but a large number are.

> 
>> sentence (breaks on unicode sentence boundaries)
> 
> That looks a bit fishy.
> 
> How are you going to work out what marks a sentence boundary in every 
> language that one can write
> with Unicode? And there are languages where the idea of a 'sentence' is 
> absent.

Again, ICU does the hard work. In a language without sentences, text will only 
contain one sentence. 

There is also enough intelligence in ICU that it can tell the difference 
between a decimal point and a full-stop/period. Some languages use different 
marks as sentence separators and ICU also knows about them.

> 
> I'm sorry to be such a "pill", but word and sentence boundaries are such 
> culture-bound concepts
> that they will only be any good for languages that mark word and sentence 
> boundaries.
> 
> This is about the same as stating dogmatically that "all bananas are yellow", 
> when they are not.

Paragraphs are defined in the Unicode standard. They are runs of text 
terminated by the Paragraph Separator character or (optionally) any other 
newline character. While it may not make sense linguistically, this is how we 
delimit paragraphs in LiveCode fields.


Regards,
Fraser
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to