Re: [lingu-dev] Assistance on Enconding different

Németh László Fri, 13 Feb 2009 10:02:03 -0800

Hi,

2009/2/6 Sunday Bolaji <[email protected]>:
> Hi,
>      Thanks for  your response.
> I am having a problem understanding why Hunspell doe not give me suggestions
> such as ẹ̀kọ́ and ẹ̀kọ (which are present in my dictionary file) for eko
> which is a likely wrong spelling  for the two words.


The problem is the missing support of Unicode combining diacritical
marks by Hunspell TRY and MAP (also by the affix  condition and n-gram
similarity algorithms).


>      In general, what changes can I make in the  TRY, REP table, MAP table,
> PHONE table  and KEY so that hunspell will suggest words in bracket as part
> of the suggested word as i shown below  (I am including my TRY, REP, MAP,
> PHONE and Key files.

Thanks for the detailed bug report.

>
> My boss Dr.Adegbola said he has written you a letter inviting you for a
> spell cheker meeting for African languages to hold here in Nigeria.  I hope
> you will be able to make it so that we can meet you.

Thanks for the kind invitation. I'm afraid, I cannot participate in
it, but I would like
to implement the combining diacritical mark support in the near
future. Your data will be a big help to check the result.

Best regards,
László

>
>
>
> Hunspell 1.2.7
> & eko 7 3: èkó, ko, oko, epo, ìko, eku, e ko    (  ẹ̀kọ́, ẹ̀kọ )
>
> & èko 8 0: èkó, ko, èwo, oko, ìko, èso, èké, è ko  (ẹ̀kọ́, ẹ̀kọ )
>
> & ekó 6 0: èkó, kó, okó, ìkó, eku, e kó       (ẹ̀kọ́, ẹ̀kọ )
>
> & ẹ̀kọ̀ 7 0: ẹ̀kọ́, ẹ̀kọ, ẹ̀rọ̀, ẹ̀dọ̀, ẹ̀yọ̀, ẹ̀ kọ̀, ẹ̀-kọ̀
>
> & ẹkọ 9 0: èkó, kọ, ẹ̀kọ, ẹbọ, akọ, ọkọ, ẹyọ, ẹkẹ, ẹ kọ   (ẹ̀kọ́ )
>
> & ẹkọ̀ 5 0: kọ̀, ẹ̀kọ, àkọ̀, ọkọ̀, ẹ kọ̀               ( ẹ̀kọ́  )
>
> & ẹkọ́ 7 0: ẹjọ́, kọ́, ẹ̀kọ́, ẹmọ́, ọkọ́, ìkọ́, ẹ kọ́     (  ẹ̀kọ  )
>
> & ẹ́kọ́ 4 0: ẹ̀kọ́, ńkọ́, ẹ́ kọ́, ẹ́-kọ́        ( ẹ̀kọ́, ẹ̀kọ )
>
> & ekọ 6 0: èkó, kọ, akọ, ọkọ, eku, e kọ  ( ẹ̀kọ́, ẹ̀kọ )
>
> & ẹko 6 0: èkó, ko, oko, ìko, ẹkẹ, ẹ ko       ( ẹ̀kọ́, ẹ̀kọ )
>
> & èkọ 6 0: èkó, kọ, akọ, ọkọ, èké, è kọ      ( ẹ̀kọ́, ẹ̀kọ )
>
> & ẹkó 6 0: èkó, kó, okó, ìkó, ẹkẹ, ẹ kó      ( ẹ̀kọ́, ẹ̀kọ )
>
> & ọrọ 12 0: òro, orò, oró, rọ, tọrọ, ọrọ̀, ọmọ, ọkọ, ọlọ, àrọ, arọ, ọrẹ   (
> ọ̀rọ̀ )
>
> & oro 9 0: orò, òro, oró, ro, oko, orí, ore, orù, o ro       (ọ̀rọ̀, ọrọ̀ )
>
> & ọro 5 0: òro, orò, oró, ro, ọrẹ           (ọ̀rọ̀, ọrọ̀ )
>
> & orọ 10 0: òro, orò, oró, rọ, àrọ, arọ, orí, ore, orù, o rọ  (ọ̀rọ̀, ọrọ̀ )
>
> & ọ̀ro 2 0: òòró, ọ̀rá      ( ọ̀rọ̀, ọrọ̀ )
>
> & ọrò 6 0: òro, orò, oró, rò, ọrẹ, èrò   ( ọ̀rọ̀, ọrọ̀ )
>
> & ọ́rọ 3 0: òòró, ọ́ rọ, ọ́-rọ      ( ọ̀rọ̀, ọrọ̀ )
>
> & ọ̀rọ 6 0: òòró, ọ̀rọ̀, ọrọ̀, ọ̀bọ, ọ̀rá, ẹ̀rọ
>
> *
>
> & ọrọ́ 5 0: ọrọ̀, rọ́, ọkọ́, ọwọ́, ọrẹ́    (  ọrọ̀  )
>
> *
>
> & ọ́rọ́ 7 0: òórọ̀, ọ̀rọ̀, tọ́rọ́, pọ́rọ́, rọ́rọ́, ọ́ rọ́, ọ́-rọ́
>
> & ọ́rọ̀ 7 0: òórọ̀, ọ̀rọ̀, ọrọ̀, tọ́rọ̀, lọ́rọ̀, ọ́ rọ̀, ọ́-rọ̀
>
> & ọ̀rọ́ 7 0: òórọ̀, ọ̀rọ̀, ọ̀wọ́, ọ̀dọ́, ọ̀yọ́, ọ̀ṣọ́, ọ̀rẹ́   ( ọrọ̀ ).
> NOTE : The " ọ̀,   ọ́,   ẹ̀,  ẹ́,  "  are combination  of  two  characters
> " ọ or ẹ " and tone mark  .
>
> ̀        ́
>
>  The sample of our affix file is also shown below:
>
> SET  UTF-8
>
> KEY  ọwertyuiop|asdfghjkl|ṣẹbnm
>
> TRY    tmnkwlbàaáóoòprọ̀ọ́ọíìfdyṣsgẹ̀ẹ́ẹéèeùúuTMNṢLBÀÁAÒÓOPRÒÓỌÌÍIGÈ
> ÉẸÈÉE
>
>
>
> REP  94
>
> REP  a  à
>
> REP  à  á
>
> REP  a  á
>
> REP  á  à
>
> REP  a  àà
>
> REP  à  àà
>
> REP  a  àá
>
> REP  à  àá
>
> REP  á  àá
>
> REP  a  áà
>
> REP  à  áà
>
> REP  á  áà
>
> REP  a  aa
>
> REP  a  aá
>
> REP  ai  àì
>
> REP  ai  a
>
> REP  ài  à
>
> REP  ái  á
>
> REP  e  è
>
> REP  è  é
>
> REP  e  é
>
> REP  é  è
>
> REP  e  ẹ̀
>
> REP  e  ẹ́
>
> REP  ẹ  ẹ̀
>
> REP  ẹ̀  ẹ́
>
> REP  ẹ  ẹ́
>
> REP  ẹ́  ẹ̀
>
> REP  e  ẹ
>
> REP  è  ẹ̀
>
> REP  é  ẹ́
>
> REP  e  èè
>
> REP  è  èè
>
> REP  e  éè
>
> REP  e  éé
>
> REP  é  éé
>
> REP  e  èé
>
> REP  e  eé
>
> REP  e  ee
>
> REP  ẹ́  ẹ́ẹ̀
>
> REP  e  ẹ́ẹ̀
>
> REP  ẹ  ẹ́ẹ̀
>
> REP  e  ẹ̀ẹ̀
>
> REP  ẹ  ẹ̀ẹ̀
>
> REP  ẹ  ẹ̀ẹ́
>
> REP  e  ẹ̀ẹ́
>
> REP  e  ẹẹ
>
> REP  ẹ  ẹẹ
>
> REP  i  ì
>
> REP  ì  í
>
> REP  i  í
>
> REP  í  ì
>
> REP  i  íì
>
> REP  i  in
>
> REP  n  ǹ
>
> REP  n  ń
>
> REP  o  ọ̀
>
> REP  o  ọ́
>
> REP  o  ò
>
> REP  ò  ó
>
> REP  o  ó
>
> REP  ó  ò
>
> REP  ọ  ọ̀
>
> REP  ọ̀  ọ́
>
> REP  ọ  ọ́
>
> REP  ọ́  ọ̀
>
> REP  o  ọ
>
> REP  ò  ọ̀
>
> REP  ó  ọ́
>
> REP  o  òò
>
> REP  ò  òò
>
> REP  o  oo
>
> REP  o  oó
>
> REP  o  òó
>
> REP  o  ọ̀ọ̀
>
> REP  ọ  ọ̀ọ̀
>
> REP  ọ̀  ọ̀ọ̀
>
> REP  ọ̀  ọ̀ọ́
>
> REP  ọ  ọ̀ọ́
>
> REP  o  ọ̀ọ́
>
> REP  ọ́  ọ̀ọ́
>
> REP  s  ṣ
>
> REP  ṣ  s
>
> REP  u  ù
>
> REP  u  ú
>
> REP  u  ùú
>
> REP  ù  ùú
>
> REP  ú  ùú
>
> REP  ù  ùù
>
> REP  u  ùù
>
> REP  h y
>
> REP  E Ẹ
>
> REP  S Ṣ
>
> REP  O Ọ
>
>
>
> MAP 12
>
> MAP àaá
>
> MAP ọ̀ọọ́óoò
>
> MAP ìíi
>
> MAP ṣs
>
> MAP ẹ̀ẹ́ẹèée
>
> MAP ǹńn
>
> MAP ùúu
>
> MAP SṢ
>
> MAP ÀÁA
>
> MAP Ọ̀Ọ́ỌÒÓO
>
> MAP ÌÍI
>
> MAP ẸÈÉE
>
>
>
> PHONE 37
>
> PHONE à a
>
> PHONE á a
>
> PHONE aa a
>
> PHONE ó o
>
> PHONE ò o
>
> PHONE ọ̀ o
>
> PHONE ọ o
>
> PHONE ọ́ o
>
> PHONE ọ̀ ọ
>
> PHONE oo o
>
> PHONE í i
>
> PHONE ì i
>
> PHONE ṣ s
>
> PHONE ẹ̀ e
>
> PHONE ẹ́ e
>
> PHONE ẹ e
>
> PHONE è e
>
> PHONE é e
>
> PHONE ee e
>
> PHONE ǹ n
>
> PHONE ń n
>
> PHONE ù u
>
> PHONE ú u
>
> PHONE uu u
>
> PHONE Ṣ S
>
> PHONE À A
>
> PHONE Á A
>
> PHONE Ò O
>
> PHONE Ọ̀ O
>
> PHONE Ó O
>
> PHONE Ọ́ O
>
> PHONE Ọ O
>
> PHONE Ì I
>
> PHONE Í I
>
> PHONE È E
>
> PHONE É E
>
> PHONE E Ẹ
>
>
>
> ICONV 7
>
> ICONV ọ  ọ
>
> ICONV ọ̀  ọ̀
>
> ICONV ọ́  ọ́
>
> ICONV ṣ  ṣ
>
> ICONV ẹ̀  ẹ̀
>
> ICONV ẹ́  ẹ́
>
> ICONV ẹ  ẹ
>
> Best regards,
>
> Jeje
>
>
>
>
>
>
>
>
>
> --- On Wed, 2/4/09, Németh László <[email protected]> wrote:
>
> From: Németh László <[email protected]>
> Subject: Re: [lingu-dev] Assistance on Enconding different
> To: [email protected]
> Cc: [email protected]
> Date: Wednesday, February 4, 2009, 1:22 AM
>
> Hi,
>
> The second method could be better for suggestions. Using multiple
> dictionaries to the same locale, spell checker component of OpenOffice.org
> 3.x will suggest in the following format:
>
> suggestion_from_the_first_dictionary1
> suggestion_from_the_first_dictionary2
> suggestion_from_the_first_dictionary3
> suggestion_from_the_second_dictionary1
> suggestion_from_the_second_dictionary2
> suggestion_from_the_second_dictionary3
> etc.
>
> So the suggestions with different encodings are in different blocks. This is
> the preferred method, if you want suggestions with multiple encodings.
>
> Best regards,
> László
>
>
>
> 2009/2/2 Sunday Bolaji <[email protected]>
>>
>> Hi,
>>     For the redundant dictionary are we putting all the words with
>> different encoding in one dictionary file or create a dictionary file each
>> for  words with the same enconding .
>>
>> Best regards
>> Jeje
>>
>> --- On Mon, 2/2/09, Németh László <[email protected]> wrote:
>>
>> From: Németh László <[email protected]>
>> Subject: Re: [lingu-dev] Assistance on Enconding different
>> To: [email protected], [email protected]
>> Date: Monday, February 2, 2009, 4:41 AM
>>
>> Hi,
>>
>> 2009/2/2 Sunday Bolaji <[email protected]>
>>>
>>> Hi,
>>>     I have tried your suggestion on temporary solution to unicode
>>> normilisation
>>> and it worked but one thing is not clear to me, are we going to have
>>> separate dictionary for all the with different encoding or are we
>>> putting in our dictionary file.
>>> Another thing i observed with
>>> hunspell is that if the number characters of correct word in the
>>> dictionary file is more than the characters of word wrongly type,
>>> hunspell will suggest diffreent word of the same length as wrong word.
>>> Examples are given below :
>>> (1)
>>> "jókòó" is the correct word in the dictionary, but it will not suggest
>>> it if i type "joke" despite specified in the REP table to replace " o"
>>> with " òó ". it will only suggest " jókòó" if the wrong type word is "
>>> jokoo "
>>>
>>> (2) " ọ̀rọ̀ " is the correct word in the dictionary, but it will not
>>> suggest it ,if " ọrọ " is type despite specified in the REP table to replace
>>> " ọ " with " ọ̀ ". And this is due to that  " ọ " is a precomposed single
>>> character and " ọ̀ "
>>>  and is combination of " ọ " and tone mark. The REP table is shown for
>>> similar characters. Please is there anything i can to solve this problem.
>>
>> REP and MAP suggestions are not combined with similarity algorithms,
>> unlike the PHONE and ph: phonetic suggestions.
>>
>> Check the following suggestion parameters:
>>
>> -- affix file ---
>> PHONE 4
>> PHONE ó o
>> PHONE ò o
>> PHONE ọ̀ o
>> PHONE ọ o
>>
>> Hunspell will convert "jókòó" to "jokoo" before comparing with the input
>> word "joke".
>> You can use PHONE for normalization, too. Unfortunately, there was a
>> potential problem with PHONE and diacritics under Windows, so it better to
>> use ph: fields (separated by tabulators) for OpenOffice.org 3.0. Also ph:
>> can work better for bigger word differences, too.
>>
>> --- dic file ----
>> jókòó ph:joko
>>
>> ọ̀rọ̀ ph:oro
>>
>>
>> Regards,
>> László
>>
>>
>>
>>
>>>
>>>
>>>
>>> REP  94
>>>
>>> REP  a  à
>>>
>>> REP  à  á
>>>
>>> REP  a  á
>>>
>>> REP  á  à
>>>
>>> REP  a  àà
>>>
>>> REP  à  àà
>>>
>>> REP  a  àá
>>>
>>> REP  à  àá
>>>
>>> REP  á  àá
>>>
>>> REP  a  áà
>>>
>>> REP  à  áà
>>>
>>> REP  á  áà
>>>
>>> REP  a  aa
>>>
>>> REP  a  aá
>>>
>>> REP  ai  àì
>>>
>>> REP  ai  a
>>>
>>> REP  ài
>>> à
>>>
>>> REP  ái  á
>>>
>>> REP  e  è
>>>
>>> REP  è  é
>>>
>>> REP  e  é
>>>
>>> REP  é  è
>>>
>>> REP  e  ẹ̀
>>>
>>> REP  e  ẹ́
>>>
>>> REP  ẹ  ẹ̀
>>>
>>> REP  ẹ̀  ẹ́
>>>
>>> REP  ẹ  ẹ́
>>>
>>> REP  ẹ́  ẹ̀
>>>
>>> REP  e  ẹ
>>>
>>> REP  è  ẹ̀
>>>
>>> REP  é  ẹ́
>>>
>>> REP  e  èè
>>>
>>> REP  è  èè
>>>
>>> REP  e  éè
>>>
>>> REP  e  éé
>>>
>>> REP  é  éé
>>>
>>> REP  e  èé
>>>
>>> REP  e  eé
>>>
>>> REP  e  ee
>>>
>>> REP  ẹ́  ẹ́ẹ̀
>>>
>>> REP  e  ẹ́ẹ̀
>>>
>>> REP  ẹ  ẹ́ẹ̀
>>>
>>> REP  e  ẹ̀ẹ̀
>>>
>>> REP  ẹ  ẹ̀ẹ̀
>>>
>>> REP  ẹ  ẹ̀ẹ́
>>>
>>> REP  e  ẹ̀ẹ́
>>>
>>> REP  e  ẹẹ
>>>
>>> REP  ẹ  ẹẹ
>>>
>>> REP  i  ì
>>>
>>> REP  ì  í
>>>
>>> REP  i  í
>>>
>>> REP  í  ì
>>>
>>> REP  i  íì
>>>
>>> REP  i  in
>>>
>>> REP  n  ǹ
>>>
>>> REP  n  ń
>>>
>>> REP  o  ọ̀
>>>
>>> REP  o  ọ́
>>>
>>> REP  o  ò
>>>
>>> REP  ò  ó
>>>
>>> REP  o  ó
>>>
>>> REP  ó  ò
>>>
>>> REP  ọ  ọ̀
>>>
>>> REP  ọ̀  ọ́
>>>
>>> REP  ọ  ọ́
>>>
>>> REP  ọ́
>>> ọ̀
>>>
>>> REP  o  ọ
>>>
>>> REP  ò
>>> ọ̀
>>>
>>> REP  ó  ọ́
>>>
>>> REP  o  òò
>>>
>>> REP  ò  òò
>>>
>>> REP  o  oo
>>>
>>> REP  o  oó
>>>
>>> REP  o  òó
>>>
>>> REP  o  ọ̀ọ̀
>>>
>>> REP  ọ  ọ̀ọ̀
>>>
>>> REP  ọ̀  ọ̀ọ̀
>>>
>>> REP  ọ̀
>>> ọ̀ọ́
>>>
>>> REP  ọ  ọ̀ọ́
>>>
>>> REP  o  ọ̀ọ́
>>>
>>> REP  ọ́
>>> ọ̀ọ́
>>>
>>> REP  s  ṣ
>>>
>>> REP  ṣ  s
>>>
>>> REP  u  ù
>>>
>>> REP  u  ú
>>>
>>> REP  u  ùú
>>>
>>> REP  ù  ùú
>>>
>>> REP  ú  ùú
>>>
>>> REP  ù  ùù
>>>
>>> REP  u  ùù
>>>
>>> REP  h y
>>>
>>> REP  E Ẹ
>>>
>>> REP  S Ṣ
>>>
>>> REP  O Ọ
>>>
>>> Best regards,Jeje
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>>
>
>
>

Re: [lingu-dev] Assistance on Enconding different

Reply via email to