Re: [Apertium-stuff] [libvoikko] Lttoolbox (Apertium) morphology backend

2011-09-02 Thread Francis Tyers
El dv 02 de 09 de 2011 a les 13:24 +, en/na Francis Tyers va
escriure:
> El dv 02 de 09 de 2011 a les 11:13 +0200, en/na Kevin Brubeck Unhammer
> va escriure:
> > Kevin Brubeck Unhammer  writes:
> > 
> > > Francis Tyers  writes:
> > >
> > >> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
> > >> escriure:
> > >>> On Sunday 28 February 2010, Francis Tyers wrote:
> > >>> > > I don't know Icelandic at all and therefore can't tell whether some 
> > >>> > > of
> > >>> > > the  words are accepted or rejected incorrectly.
> > >>> > 
> > >>> > Nice, it looks good. Some of the capitalised words should be 
> > >>> > recognised
> > >>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
> > >>> 
> > >>> I tried to fix the checking of capitalized words but started to run 
> > >>> into 
> > >>> problems. It seems that the library API works in somewhat surprising 
> > >>> (at least 
> > >>> to me) ways when you enter a word that starts with a capital letter and 
> > >>> ends 
> > >>> with garbage.
> > >>> 
> > >>> The implementation is here
> > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
> > >>> 
> > >>> and test cases here
> > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
> > >>> 
> > >>> I was able to get all test cases expect the one with TODO in method 
> > >>> name 
> > >>> implemented. How would you suggest fixing the code so that all tests 
> > >>> would 
> > >>> pass? Of course a patch would be most welcome :)
> > >>
> > >> Hmm, strangely enough, when I try an unknown word I get similar strange
> > >> output:
> > >>
> > >> $ ./test mor.bin 
> > >> ^Reykjanghfghesi$ -->
> > >> ^Reykja/Reykja/Reykur$
> > >
> > > Seems to be a bug with partly-matching regexes in the biltrans
> > > functions.
> > >
> > > Testing the different functions, I get:
> > >
> > > biltransWithQueue: 
> > > ^Reykja/Reykja/Reykur$
> > >  qSize: 0
> > > biltransWithoutQueue: 
> > > ^Reykja/Reykja/Reykur$
> > > biltrans: 
> > > ^Reykja/Reykja/Reykur$
> > > biltransfull: ^$
> > >
> > > But, if I comment out the two regex entries
> > >
> > >   
> > >   
> > >
> > > at the end of apertium-is-en.is.dix, I get
> > >
> > > biltransWithQueue: @Reykjanghfghesi qSize: 0
> > > biltransWithoutQueue: @Reykjanghfghesi
> > > biltrans: @Reykjanghfghesi
> > > biltransfull: @Reykjanghfghesi
> > >
> > > Similarly on the command line with lt-proc -b (while regular lt-proc -a
> > > returns unknown, as it should – the persons/orgnisations regexes don't
> > > fully match either).
> > 
> > I put a patch up at
> > http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which
> > solves this for both lt-proc -b, as well as biltransWithQueue. Please
> > test.
> > 
> > I haven't tried with the other biltrans* functions (I can't see that
> > they're actually used in the rest of Apertium, so I'm not sure what
> > they're there for).
> > 
> > It also fixes a problem where superfluous characters after tags would
> > pass as matches in lt-proc -b (this bug was not present in
> > biltransWithQueue). It's still possible to carry over _tags_ after the
> > analysis of course.
> > 
> > 
> > I guess it's not strange that this bug was here, since normally you
> > never have words without tags in bidix, but when using these functions
> > on a monodix it of course becomes a problem. (And, although it's not
> > recommended, if people really do want to have non-tagged lemmas in
> > bidix, lttoolbox should at least not give analyses for lemmas that are
> > _not_ in the bidix.)
> > 
> > 
> > best regards,
> > Kevin Brubeck Unhammer
> 
> Looks good to me, and to Jim. We suggest commit and close. I'm going to
> do one final test, running a corpus with lt-proc -b before and after the
> patch and see if there are any difference. I'll report back soon.

$ wc -l /tmp/ca-BILTRANS.*
   376857 /tmp/ca-BILTRANS.new
   376857 /tmp/ca-BILTRANS.old
   753714 total

$ cmp /tmp/ca-BILTRANS.old /tmp/ca-BILTRANS.new

No changes in ca->en over 376857 lines of the Catalan Wikipedia.

Fran


--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] [libvoikko] Lttoolbox (Apertium) morphology backend

2011-09-02 Thread Francis Tyers
El dv 02 de 09 de 2011 a les 11:13 +0200, en/na Kevin Brubeck Unhammer
va escriure:
> Kevin Brubeck Unhammer  writes:
> 
> > Francis Tyers  writes:
> >
> >> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
> >> escriure:
> >>> On Sunday 28 February 2010, Francis Tyers wrote:
> >>> > > I don't know Icelandic at all and therefore can't tell whether some of
> >>> > > the  words are accepted or rejected incorrectly.
> >>> > 
> >>> > Nice, it looks good. Some of the capitalised words should be recognised
> >>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
> >>> 
> >>> I tried to fix the checking of capitalized words but started to run into 
> >>> problems. It seems that the library API works in somewhat surprising (at 
> >>> least 
> >>> to me) ways when you enter a word that starts with a capital letter and 
> >>> ends 
> >>> with garbage.
> >>> 
> >>> The implementation is here
> >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
> >>> 
> >>> and test cases here
> >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
> >>> 
> >>> I was able to get all test cases expect the one with TODO in method name 
> >>> implemented. How would you suggest fixing the code so that all tests 
> >>> would 
> >>> pass? Of course a patch would be most welcome :)
> >>
> >> Hmm, strangely enough, when I try an unknown word I get similar strange
> >> output:
> >>
> >> $ ./test mor.bin 
> >> ^Reykjanghfghesi$ -->
> >> ^Reykja/Reykja/Reykur$
> >
> > Seems to be a bug with partly-matching regexes in the biltrans
> > functions.
> >
> > Testing the different functions, I get:
> >
> > biltransWithQueue: 
> > ^Reykja/Reykja/Reykur$
> >  qSize: 0
> > biltransWithoutQueue: 
> > ^Reykja/Reykja/Reykur$
> > biltrans: 
> > ^Reykja/Reykja/Reykur$
> > biltransfull: ^$
> >
> > But, if I comment out the two regex entries
> >
> >   
> >   
> >
> > at the end of apertium-is-en.is.dix, I get
> >
> > biltransWithQueue: @Reykjanghfghesi qSize: 0
> > biltransWithoutQueue: @Reykjanghfghesi
> > biltrans: @Reykjanghfghesi
> > biltransfull: @Reykjanghfghesi
> >
> > Similarly on the command line with lt-proc -b (while regular lt-proc -a
> > returns unknown, as it should – the persons/orgnisations regexes don't
> > fully match either).
> 
> I put a patch up at
> http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which
> solves this for both lt-proc -b, as well as biltransWithQueue. Please
> test.
> 
> I haven't tried with the other biltrans* functions (I can't see that
> they're actually used in the rest of Apertium, so I'm not sure what
> they're there for).
> 
> It also fixes a problem where superfluous characters after tags would
> pass as matches in lt-proc -b (this bug was not present in
> biltransWithQueue). It's still possible to carry over _tags_ after the
> analysis of course.
> 
> 
> I guess it's not strange that this bug was here, since normally you
> never have words without tags in bidix, but when using these functions
> on a monodix it of course becomes a problem. (And, although it's not
> recommended, if people really do want to have non-tagged lemmas in
> bidix, lttoolbox should at least not give analyses for lemmas that are
> _not_ in the bidix.)
> 
> 
> best regards,
> Kevin Brubeck Unhammer

Looks good to me, and to Jim. We suggest commit and close. I'm going to
do one final test, running a corpus with lt-proc -b before and after the
patch and see if there are any difference. I'll report back soon.

Fran


--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] [libvoikko] Lttoolbox (Apertium) morphology backend

2011-09-02 Thread Kevin Brubeck Unhammer
Kevin Brubeck Unhammer  writes:

> Francis Tyers  writes:
>
>> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
>> escriure:
>>> On Sunday 28 February 2010, Francis Tyers wrote:
>>> > > I don't know Icelandic at all and therefore can't tell whether some of
>>> > > the  words are accepted or rejected incorrectly.
>>> > 
>>> > Nice, it looks good. Some of the capitalised words should be recognised
>>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
>>> 
>>> I tried to fix the checking of capitalized words but started to run into 
>>> problems. It seems that the library API works in somewhat surprising (at 
>>> least 
>>> to me) ways when you enter a word that starts with a capital letter and 
>>> ends 
>>> with garbage.
>>> 
>>> The implementation is here
>>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
>>> 
>>> and test cases here
>>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
>>> 
>>> I was able to get all test cases expect the one with TODO in method name 
>>> implemented. How would you suggest fixing the code so that all tests would 
>>> pass? Of course a patch would be most welcome :)
>>
>> Hmm, strangely enough, when I try an unknown word I get similar strange
>> output:
>>
>> $ ./test mor.bin 
>> ^Reykjanghfghesi$ -->
>> ^Reykja/Reykja/Reykur$
>
> Seems to be a bug with partly-matching regexes in the biltrans
> functions.
>
> Testing the different functions, I get:
>
> biltransWithQueue: 
> ^Reykja/Reykja/Reykur$
>  qSize: 0
> biltransWithoutQueue: 
> ^Reykja/Reykja/Reykur$
> biltrans: 
> ^Reykja/Reykja/Reykur$
> biltransfull: ^$
>
> But, if I comment out the two regex entries
>
>   
>   
>
> at the end of apertium-is-en.is.dix, I get
>
> biltransWithQueue: @Reykjanghfghesi qSize: 0
> biltransWithoutQueue: @Reykjanghfghesi
> biltrans: @Reykjanghfghesi
> biltransfull: @Reykjanghfghesi
>
> Similarly on the command line with lt-proc -b (while regular lt-proc -a
> returns unknown, as it should – the persons/orgnisations regexes don't
> fully match either).

I put a patch up at
http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which
solves this for both lt-proc -b, as well as biltransWithQueue. Please
test.

I haven't tried with the other biltrans* functions (I can't see that
they're actually used in the rest of Apertium, so I'm not sure what
they're there for).

It also fixes a problem where superfluous characters after tags would
pass as matches in lt-proc -b (this bug was not present in
biltransWithQueue). It's still possible to carry over _tags_ after the
analysis of course.


I guess it's not strange that this bug was here, since normally you
never have words without tags in bidix, but when using these functions
on a monodix it of course becomes a problem. (And, although it's not
recommended, if people really do want to have non-tagged lemmas in
bidix, lttoolbox should at least not give analyses for lemmas that are
_not_ in the bidix.)


best regards,
Kevin Brubeck Unhammer


--
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] [libvoikko] Lttoolbox (Apertium) morphology backend

2011-08-16 Thread Kevin Brubeck Unhammer
Francis Tyers  writes:

> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
> escriure:
>> On Sunday 28 February 2010, Francis Tyers wrote:
>> > > I don't know Icelandic at all and therefore can't tell whether some of
>> > > the  words are accepted or rejected incorrectly.
>> > 
>> > Nice, it looks good. Some of the capitalised words should be recognised
>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
>> 
>> I tried to fix the checking of capitalized words but started to run into 
>> problems. It seems that the library API works in somewhat surprising (at 
>> least 
>> to me) ways when you enter a word that starts with a capital letter and ends 
>> with garbage.
>> 
>> The implementation is here
>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
>> 
>> and test cases here
>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
>> 
>> I was able to get all test cases expect the one with TODO in method name 
>> implemented. How would you suggest fixing the code so that all tests would 
>> pass? Of course a patch would be most welcome :)
>
> Hmm, strangely enough, when I try an unknown word I get similar strange
> output:
>
> $ ./test mor.bin 
> ^Reykjanghfghesi$ -->
> ^Reykja/Reykja/Reykur$

Seems to be a bug with partly-matching regexes in the biltrans
functions.

Testing the different functions, I get:

biltransWithQueue: 
^Reykja/Reykja/Reykur$
 qSize: 0
biltransWithoutQueue: 
^Reykja/Reykja/Reykur$
biltrans: 
^Reykja/Reykja/Reykur$
biltransfull: ^$

But, if I comment out the two regex entries

  
  

at the end of apertium-is-en.is.dix, I get

biltransWithQueue: @Reykjanghfghesi qSize: 0
biltransWithoutQueue: @Reykjanghfghesi
biltrans: @Reykjanghfghesi
biltransfull: @Reykjanghfghesi

Similarly on the command line with lt-proc -b (while regular lt-proc -a
returns unknown, as it should – the persons/orgnisations regexes don't
fully match either).


-- 
Kevin Brubeck Unhammer

--
uberSVN's rich system and user administration capabilities and model 
configuration take the hassle out of deploying and managing Subversion and 
the tools developers use with it. Learn more about uberSVN and get a free 
download at:  http://p.sf.net/sfu/wandisco-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff