On 24 Sep 2014, at 12:23, Roland King <r...@rols.org> wrote:

> 
>> On 24 Sep 2014, at 1:02 pm, Gerriet M. Denkmann <gerr...@mdenkmann.de> wrote:
>> 
>> 
>> On 24 Sep 2014, at 11:46, Roland King <r...@rols.org> wrote:
>> 
>>> 
>>>> On 24 Sep 2014, at 12:31 pm, Gerriet M. Denkmann <gerr...@mdenkmann.de> 
>>>> wrote:
>>>> 
>>>> I have a problem with NSLinguisticTagger / CFStringTokenizer on iOS 8.0
>>>> 
>>>> OS X 10.9.5 (and iOS 7 and earlier) parses "สีเหลือง" quite rightly as two 
>>>> words: "สี" = colour and "เหลือง" = yellow.
>>>> 
>>>> No dictionary will ever contain "yellow colour". Every dictionary will 
>>>> contain "yellow" and "colour".
>>>> There are hundreds, if not thousands of these expressions, which are 
>>>> wrongly classified as one word.
>>>> Might have something to do with the new predictive keyboard.
>>>> 
>>>> But I am not writing this to complain, but to ask for a favour: could 
>>>> anybody on 10.10 just click anywhere in: "สีเหลือง" and tell me whether 
>>>> all gets highlighted, or just a part (as in 10.9.5)?
>>> 
>>> 
>>> If I double click anywhere on the right of that I get the second part (all 
>>> bar the first character) highlighted. Clicking on the first character I get 
>>> just that character. So 10.10 (beta 8) splits that sequence into two 
>>> ‘words’. 
>> This is a big relief. Thanks a lot.
>> 
>>> 
>>> Why do you suspect the predictive keyboard? Certainly wouldn’t be the first 
>>> thing I thought of seeing that issue. I would probably instead assume I’d 
>>> written myself a bug.
>> 
>> Well, here is the code; maybe you can find a bug:
>> 
>> let text = "สีเหลือง"
>> let opts: Int = 0
>> let schemes = [ NSLinguisticTagSchemeTokenType, 
>> NSLinguisticTagSchemeNameTypeOrLexicalClass ]
>> let tagger = NSLinguisticTagger(tagSchemes: schemes, options: opts )
>> 
>> let nsText = text as NSString
>> let length = nsText.length
>> tagger.string = nsText
>> let range = NSMakeRange(0,length)
>> let theScheme = NSLinguisticTagSchemeTokenType
>> let ops = NSLinguisticTaggerOptions(0)
>> tagger.enumerateTagsInRange (        
>>      range, 
>>      scheme:         theScheme, 
>>      options:        ops,
>>      usingBlock: 
>>      {       (       tag:                    String!, 
>>                      tokenRange:     NSRange, 
>>                      sentenceRange:  NSRange, 
>>                      stop:                   UnsafeMutablePointer<ObjCBool>
>>              ) -> Void in
>>              
>>              let word = nsText.substringWithRange(tokenRange) 
>>              println("\(tag) = \(word) " )
>>      }
>> )
>> 
>> Gerriet.
>> 
> 
> 
> 
> Here’s my version I was just writing - I ran it in an iOS playground AND in 
> an OSX playground and get the same ‘single word’ result either time. So I’m 
> not entirely sure that the click test on OSX proved anything. If you comment 
> out the Thai string and uncomment Chinese one, it works better and splits 
> stuff up although the last two words are wrong there as well, they should be 
> ‘去“ and “健身房“. It’s the same in an OSX playground and an iOS one but then 
> again iOS playgrounds are emulated so .. 
> 
> I also compiled it as an OSX command line tool and it does the same thing for 
> my phrase AND yours. So whatever is doing the highlighting when you ‘click’ 
> isn’t the same thing NSLinguisticTagger is doing. 
> 
> The click test works on my chinese phrase too, it gets the last two words 
> correct. Something sure ain’t right. 
> 
> Should write the objc version to eliminate any possibility it’s swift. 

I have an app in ObjC using NSLinguisticTagger, which on 10.9.5 prints:
        "我"     = Word
        "今天"    = Word
        "还"     = Word
        "没有"    = Word
        "去健"    = Word  <-- wrong
        "身房"    = Word  <-- wrong
But when I click on "去" I just get "to go",
and when I click on "健身房" I get "gym".

So, you are right: the clicking algorithm seems NOT to be using 
NSLinguisticTagger. And I didn't go to the gym either.

Further investigating (again ObjC on 10.9.5):
CFStringTokenizer as wrong as NSLinguisticTagger

Icu 51.1 correct:
token[1] {0, 1} = "我"   -- UnKnown Word --
token[2] {1, 2} = "今天"  -- UnKnown Word --
token[3] {3, 1} = "还"   -- UnKnown Word --
token[4] {4, 2} = "没有"  -- UnKnown Word --
token[5] {6, 1} = "去"   -- UnKnown Word --
token[6] {7, 3} = "健身房" -- UnKnown Word --

NSTextView (selectionRangeForProposedRange:granularity: NSSelectByWord), 
AttributedString (doubleClickAtIndex:) correct as Icu.

I thought that all were based on Icu, but this proves that I am wrong.

Probably I should use doubleClickAtIndex, now that iOS has AttributedStrings.


> let str = "สีเหลือง"
> //let str = "我今天还没有去健身房"
> let str2 = str as NSString
> 
> let tagger = NSLinguisticTagger(tagSchemes:  
> [NSLinguisticTagSchemeTokenType], options: 0 )
> 
> 
> let range = NSMakeRange( 0, str2.length )
> 
> tagger.string = str2
> 
> var ranges : NSArray?
> let things = tagger.tagsInRange( range, scheme: 
> NSLinguisticTagSchemeTokenType, options: NSLinguisticTaggerOptions.allZeros, 
> tokenRanges: &ranges )
> things.count
> 
> ranges
> 
> for ( index, type ) in enumerate( things )
> {
>       let type_range : NSValue? = ranges?[ index ] as NSValue?
>       print( "Type: '\(type)' at \(type_range!) ")
>       println( str2.substringWithRange(type_range! ) )
> 
> }
> 
> 


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to