Re: [Apertium-stuff] Stop merging lines

2018-11-04 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Hello!
>
> I have a very big file (some millions of lines) with one sentence per line.
>
> When I run Apertium's tagger sometimes it merges those lines. I tried to
> insert empty lines between real lines and it merged fewer lines. I inserted
> 10 empty lines and it merges even fewer lines, but there are some merging
> what is not acceptable for me. What can I do to stop merging lines?
>
> cat file.txt | sed -r 's/$/\n\n\n\n\n\n\n\n\n\n/' | apertium -n -d
> ./apertium-tat tat-tagger | cg-proc ./apertium-tat/dev/mansur.bin > file.txt

Does tat-tagger without cg-proc do it too?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-04 Thread Kevin Brubeck Unhammer
Kevin Brubeck Unhammer  čálii:

>>
>> cat file.txt | sed -r 's/$/\n\n\n\n\n\n\n\n\n\n/' | apertium -n -d
>> ./apertium-tat tat-tagger | cg-proc ./apertium-tat/dev/mansur.bin > file.txt
>
> Does tat-tagger without cg-proc do it too?

That is, you're running the tat-tagger pipeline (what's in
modes/tat-tagger.mode) inside the deformatted output, but then you run
cg-proc on the reformatted output. I don't know if this is the reason
for line-merging error, but it will lead to errors. You should instead
just

cp apertium-tat/modes/tat-tagger.mode 
apertium-tat/modes/tat-tagger-devcg.mode

and edit apertium-tat/modes/tat-tagger-devcg.mode and add
"| cg-proc dev/mansur.bin" to the pipeline, and use

apertium -n -d ./apertium-tat tat-tagger-devcg < input.txt > output.txt


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Fwd: Stop merging lines

2018-11-04 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> My previous letter shows that merging doesn't happen because of additional
> cg-proc, because I tried to remove that part completely.

I haven't seen this with just lt-proc + apertium-tagger before. Can you
paste the contents of apertium-tat/tat-tagger.mode ?

> By the way, I also tried your recommendation and it gives an error:
>
> root@apertium:~# apertium -n -d ./apertium-tat tat-tagger-devcg file.txt
> VISL CG-3 Disambiguator version 0.9.9.11656
> cg-proc: process a stream with a constraint grammar

That looks like it couldn't find the .bin file; try using the absolute
path (all of /home/…/dev/mansur.bin)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-04 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> I corrected the rule and it works without errors. But I still have a
> problem with merging lines.

Worst case, you could install apy and send one and one line through, but
it'll be slower, probably at least by 4x:
https://stackoverflow.com/a/47422332/69663


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-05 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Hello!
>
> 1) I tried all the solutions recommended here to avoid merging lines, but
> nothing helped... The only thing I didn't try yet is apertium-apy, but
> Kevin said this way is at least 4 times slower.

With the tat-mansur mode (git pull && make) I get the same amount of
lines for the txt files in dev: 

$ cat dev/*.txt|wc -l
3866
$ cat dev/*.txt |apertium -d . tat-mansur |wc -l
3866

Can you try to figure out where in your test corpus this happens, and
give a minimal example?



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-06 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

> Yes it does. It will put a sentence boundary after every word, meaning
> that you won't get reliable tagger output. Apertium as far as I know
> has no way to treat sentences as a sequence of lines. This is because
> of how the format handling works.
>
> I think it would really be an excellent feature though. Perhaps a
> GitHub issue? I do however think it would involve messing with quite a
> bit of the pipeline.

However, we *should* treat NUL as hard separators – if we don't,
apertium-apy (and thus www.apertium.org) will risk sending output meant
for person1 to person2. (I have an inkling there might still be bugs in
apertium-transfer related to this.)

Anyway, if we at least handle NUL's correctly in lt-proc and cg-proc,
you could turn linebreak's into NUL's (first deleting any existing NUL's
in the corpus) and tag with the -z option to lt-/cg-proc:

cat corpus.txt   \
| tr -d '\0' \
| tr '\n' '\0'   \
| apertium-deshtml -n\
| lt-proc -z -w 'apertium-tat/tat.automorf.bin'  \
| cg-proc -z 'apertium-tat/tat.rlx.bin'  \
| cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \
| tr '\0' '\n'   \
| apertium-rehtml-noent

… finally turning NUL's back into newlines.

With apertium-nob, this doesn't seem to run slower than without -z, and
doesn't merge lines in my test corpus.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-07 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Turned out disappears the last token in the meaning of Apertium, no matter
> it is a word or punctuation, just last part like ^./.$ or
> ^word/lemma$

Hm, yeah it seems the NUL needs to go after the `]' on each linebreak
(that's how apy does it). Something like a sed 's/^]/]\x00/' after
deformatting might work better.

I'm not sure how to avoid the final three NUL's at end-of-file, though
they're easy enough to postprocess out.

I'd still like to see a minimal test case where the regular pipeline
merges lines though, lt-proc and cg-proc really shouldn't do that
(unless you do things like REMCOHORT in CG).


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-08 Thread Kevin Brubeck Unhammer
mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:

> Some examples of Apertium's tagger messing with lines.
>
> Original:
> Китаплар да, кешеләр
> дә кайтты.
>
> Аңа ярдәм
> итәргә кирәк.
>
> Output lines where partial merging occurred:
> ^Китаплар да/Китап+да$^,/,$ ^кешеләр
> дә/кеше+да$
> ^кайтты/кайт$^./.$
>
> ^Аңа/Ул$ ^ярдәм итәргә/ярдәм ит$
> ^кирәк/кирәк+и$^./.$
>
> It is very difficult to find such cases in the big corpus.
>
> Best!
> Mansur

OK, so this isn't actually two lines getting merged into one (that's why
the wc -l is the same), but a multiword where the latter part is moved
before the linebreak so it can actually be part of the analysis, ie.

кешеләр
дә 

on two lines gets the analysis

^кешеләр дә/кеше+да$


where the linebreak is output *after* the analysis.

Do you not want the multiword analysis here? In that case, putting some
noise like .@#@ at the end of lines should work, assuming you have no
multiwords with those characters (but when doing translation, the period
at least should get an analysis, since unanalysed noise can get moved
around (or deleted) by transfer rules).

The NUL solution also works, but it seems the tools expect the NUL to
come after a superblank like [][\n], so

$ sed 's/proc /proc -z /g' modes/tat-mansur.mode >modes/tat-mansur-z.mode

$ cat /tmp/test   \
  | tr -d '\0'\
  | apertium-deshtml -n   \
  | sed 's/\[$/[][/; s/^]/]\x00/' \
  | sh modes/tat-mansur-z.mode\
  | tr -d '\0'\
  | apertium-rehtml-noent
^Китаплар да/Китап+да$^,/,$ ^кешеләр/кеше$
 ^дә/да$ ^кайтты/кайт$^./.$

^Аңа/Ул$ ^ярдәм/ярдәм$
 ^итәргә/ит$ 
^кирәк/кирәк+и$^./.$


Maybe it'd make sense to have that as an option to apertium-destxt or
similar? So "apertium -f lines -d . tat-mansur" would add the -z's and
run with NUL's on each line, making the tools treat each line
separately, as if you'd just typed 'echo "$line"|apertium -d . tat-mansur'
for every line.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-08 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

[...]

>>> That would be a good feature, but wouldn't get past the issue of the
>>> tagger/cg. E.g. if we do that then the tagger can't take into account
>>> context.
>>
>> Isn't that the whole point? (Ie. treat each line as completely
>> independent, no context.)
>
> I don't think so, I think Mansur wants the tagger to disambiguate
> according
> to the context, but have it in line-by-line output, like TreeTagger or
> UDpipe
> etc.

Well, it's only lt-proc doing the moving, so just move the NUL-deletion
before cg-proc:

   cat corpus.txt \
   | tr -d '\0'   \
   | apertium-deshtml -n  \
   | sed 's/\[$/[][/; s/^]/]\x00/'\
   | lt-proc -z -w 'tat.automorf.bin' \
   | tr -d '\0'   \
   | cg-proc -z  'tat.rlx.bin'\
   | cg-proc -z -w -1 dev/mansur.bin' \
   | apertium-rehtml-noent

Now only lt-proc should treat end-of-line as a stream delimiter.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-08 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

>>
>> Maybe it'd make sense to have that as an option to apertium-destxt or
>> similar? So "apertium -f lines -d . tat-mansur" would add the -z's and
>> run with NUL's on each line, making the tools treat each line
>> separately, as if you'd just typed 'echo "$line"|apertium -d
>> . tat-mansur'
>> for every line.
>
> That would be a good feature, but wouldn't get past the issue of the
> tagger/cg. E.g. if we do that then the tagger can't take into account
> context.

Isn't that the whole point? (Ie. treat each line as completely
independent, no context.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:

> One more example:
>
> - Фәнис Яруллин �
> - Фәнис Яруллинга багышланган чараларның һәрберсендә катнашырга тырышам, -
> диде әдипнең дусты Мохтар Афзалов.
>
> ^-/-$ ^Фәнис/Фәнис$
> ^Яруллин/Яруллин$ �-/-$
> ^Фәнис/Фәнис$ ^Яруллинга/Яруллин$
> ^багышланган/багышла$ ^чараларның/чара$
> ^һәрберсендә/*һәрберсендә$ ^катнашырга/катнаш$
> ^тырышам/тырыш$^,/,$ ^-/-$
> ^диде/ди$ ^әдипнең/әдип$
> ^дусты/дуст$ ^Мохтар/Мохтар$
> ^Афзалов/Афзалов+и$^./.$
>
> Here it happens because of some broken char... But why?

I can't reproduce it, but maybe the broken character didn't survive the
e-mail. Could you e.g. put a text file with it on https://filebin.net/ ?



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
I still get 5 lines for that, could you upload the output you get too?
I get:

http://sprunge.us/fJYZbm

-Kevin

mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:

> Hi!
> I uploaded it here:
> https://filebin.net/46e383wip8h2qcrc
>
>
> Am Fr., 9. Nov. 2018 um 11:00 Uhr schrieb Kevin Brubeck Unhammer <
> unham...@fsfe.org>:
>
>> mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii:
>>
>> > One more example:
>> >
>> > - Фәнис Яруллин �
>> > - Фәнис Яруллинга багышланган чараларның һәрберсендә катнашырга тырышам,
>> -
>> > диде әдипнең дусты Мохтар Афзалов.
>> >
>> > ^-/-$ ^Фәнис/Фәнис$
>> > ^Яруллин/Яруллин$ �-/-$
>> > ^Фәнис/Фәнис$ ^Яруллинга/Яруллин$
>> > ^багышланган/багышла$
>> ^чараларның/чара$
>> > ^һәрберсендә/*һәрберсендә$ ^катнашырга/катнаш$
>> > ^тырышам/тырыш$^,/,$ ^-/-$
>> > ^диде/ди$ ^әдипнең/әдип$
>> > ^дусты/дуст$ ^Мохтар/Мохтар$
>> > ^Афзалов/Афзалов+и$^./.$
>> >
>> > Here it happens because of some broken char... But why?
>>
>> I can't reproduce it, but maybe the broken character didn't survive the
>> e-mail. Could you e.g. put a text file with it on https://filebin.net/ ?
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
> ___
> Apertium-stuff mailing list
> apertium-stuff-5nwgofrqmnerv+lv9mx5uipxlwaov...@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Hi!
>
> root@apertium:~# locale
> LANG=ru_RU.UTF-8
> LANGUAGE=
> LC_CTYPE="ru_RU.UTF-8"
> LC_NUMERIC="ru_RU.UTF-8"
> LC_TIME="ru_RU.UTF-8"
> LC_COLLATE=C
> LC_MONETARY="ru_RU.UTF-8"
> LC_MESSAGES="ru_RU.UTF-8"
> LC_PAPER="ru_RU.UTF-8"
> LC_NAME="ru_RU.UTF-8"
> LC_ADDRESS="ru_RU.UTF-8"
> LC_TELEPHONE="ru_RU.UTF-8"
> LC_MEASUREMENT="ru_RU.UTF-8"
> LC_IDENTIFICATION="ru_RU.UTF-8"
> LC_ALL=
>
> What did you mean by "each of you"?

I'm guessing "each of mansur & kevin" :-)

LANG=nn_NO.UTF-8
LANGUAGE=nn_NO:nn:no_NO:no:nb_NO:nb:en
LC_CTYPE="nn_NO.UTF-8"
LC_NUMERIC="nn_NO.UTF-8"
LC_TIME="nn_NO.UTF-8"
LC_COLLATE="nn_NO.UTF-8"
LC_MONETARY="nn_NO.UTF-8"
LC_MESSAGES="nn_NO.UTF-8"
LC_PAPER="nn_NO.UTF-8"
LC_NAME="nn_NO.UTF-8"
LC_ADDRESS="nn_NO.UTF-8"
LC_TELEPHONE="nn_NO.UTF-8"
LC_MEASUREMENT="nn_NO.UTF-8"
LC_IDENTIFICATION="nn_NO.UTF-8"
LC_ALL=

It seems we both have UTF-8, the only difference is in LANGUAGE and
LC_COLLATE – I wouldn't have thought any of them would matter, but I get
U+1F609 WINKING FACE 😉
(and a newline) where you get
U+FFFD REPLACEMENT CHARACTER �
so it definitely seems encoding-related.

Is it lt-proc or cg-proc that does it?



> Am Fr., 9. Nov. 2018 um 11:44 Uhr schrieb Xavi Ivars 
> :
>
>> What are the encodings that each of you are using in the shell? Is it a
>> UTF one in both cases?
>>
>>
>> --
>> Xavi Ivars
>> < http://xavi.ivars.me >
>>
>> El dv., 9 de nov. 2018, 09:41, mansur 
>> <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> va escriure:
>>
>>> Strange.
>>> I uploaded my output here: https://filebin.net/c7mikerq2vwv08ql
>>>
>>>
>>> Am Fr., 9. Nov. 2018 um 11:31 Uhr schrieb Kevin Brubeck Unhammer <
>>> unham...@fsfe.org>:
>>>
>>>> I still get 5 lines for that, could you upload the output you get too?
>>>> I get:
>>>>
>>>> http://sprunge.us/fJYZbm
>>>>
>>>> -Kevin
>>>>
>>>> mansur 
>>>> <6688000-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org> 
>>>> čálii:
>>>>
>>>> > Hi!
>>>> > I uploaded it here:
>>>> > https://filebin.net/46e383wip8h2qcrc
>>>> >
>>>> >
>>>> > Am Fr., 9. Nov. 2018 um 11:00 Uhr schrieb Kevin Brubeck Unhammer <
>>>> > unham...@fsfe.org>:
>>>> >
>>>> >> mansur 
>>>> >> <6688000-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
>>>> >>  čálii:
>>>> >>
>>>> >> > One more example:
>>>> >> >
>>>> >> > - Фәнис Яруллин �
>>>> >> > - Фәнис Яруллинга багышланган чараларның һәрберсендә катнашырга
>>>> тырышам,
>>>> >> -
>>>> >> > диде әдипнең дусты Мохтар Афзалов.
>>>> >> >
>>>> >> > ^-/-$ ^Фәнис/Фәнис$
>>>> >> > ^Яруллин/Яруллин$ �-/-$
>>>> >> > ^Фәнис/Фәнис$ ^Яруллинга/Яруллин$
>>>> >> > ^багышланган/багышла$
>>>> >> ^чараларның/чара$
>>>> >> > ^һәрберсендә/*һәрберсендә$ ^катнашырга/катнаш$
>>>> >> > ^тырышам/тырыш$^,/,$ ^-/-$
>>>> >> > ^диде/ди$ ^әдипнең/әдип$
>>>> >> > ^дусты/дуст$ ^Мохтар/Мохтар$
>>>> >> > ^Афзалов/Афзалов+и$^./.$
>>>> >> >
>>>> >> > Here it happens because of some broken char... But why?
>>>> >>
>>>> >> I can't reproduce it, but maybe the broken character didn't survive
>>>> the
>>>> >> e-mail. Could you e.g. put a text file with it on
>>>> https://filebin.net/ ?
>>>> >>
>>>> >> ___
>>>> >> Apertium-stuff mailing list
>>>> >> Apertium-stuff@lists.sourceforge.net
>>>> >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>> >>
>>>> >
>>>> > ___
>>>> > Apertium-stuff mailing list
>>>> > apertium-stuff-5nwgofrqmnerv+lv9mx5uipxlwaovq5f-xmd5yjdbdmrexy1tmh2...@public.gmane.org
>>>> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>> >
>>>>
>>>> ___
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> I tried different combinations:
> apertium-destxt -n |
> lt-proc -z -w 'apertium-tat/tat.automorf.bin' |
> cg-proc -z 'apertium-tat/tat.rlx.bin' |
> cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' |
> apertium-retxt |
>
> And it does not merge if I remove "cg-proc -z -w -1
> 'apertium-tat/dev/mansur.bin' | " from the pipeline. Could you take a look
> at the rules there?

I can't see anything in dev/mansur.rlx that would cause this.

You could try seeing what happens when you remove '-w' or '-1' options
from that part of the pipeline, or you could try a different locale
(e.g. nn_NO.UTF-8):

https://askubuntu.com/questions/89976/how-do-i-change-the-default-locale-in-ubuntu-server#89983


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Merging occures in these cases:
>
> apertium-destxt -n |
> lt-proc -z -w 'apertium-tat/tat.automorf.bin' |
> cg-proc -z 'apertium-tat/tat.rlx.bin' |
> cg-proc -z 'apertium-tat/dev/mansur.bin' |
> apertium-retxt |
>
> root@apertium:~# locale
> LANG=en_US.UTF-8
> LANGUAGE=
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
>
> And it stops merging only when I remove mansur.bin from the pipeline...
>
> What else can I try?

wild guess, but could you try recompiling mansur.bin after changing
locale? (It's really strange that tat.rlx.bin doesn't merge the lines
when mansur.bin does, with the exact same options to cg-proc.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> root@apertium:~# locale
> LANG=en_US.UTF-8
> LANGUAGE=
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
>
> cg-comp dev/mansur.rlx dev/mansur.bin
>
> apertium-destxt -n |
> lt-proc -z -w 'apertium-tat/tat.automorf.bin' |
> cg-proc -z 'apertium-tat/tat.rlx.bin' |
> cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' |
> apertium-retxt |
>
> No success, it merges. What else can I try? :)

I really don't know, other than checking you have up-to-date apertium
nightly packages etc.; this is quite strange.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

[...]

> apertium/now 3.4.2~r68466-0ubuntu1~precise1 amd64 [installed,local]

This is out of date. Try adding the nightly apt repo as in
http://wiki.apertium.org/wiki/Debian

$ dpkg -l lttoolbox apertium | grep ^ii
ii  apertium   3.5.2+g712~31845949-1~bionic1 amd64Shallow-transfer 
machine translation engine
ii  lttoolbox  3.5.0+g424~93dd6c96-1~bionic1 amd64Apertium lexical 
processing modules and tools


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-09 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Should I run 'cg-comp dev/mansur.rlx dev/mansur.bin' before 'autogen.sh &&
> make' or after?

If it's not in the makefiles anyway, it doesn't matter.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Stop merging lines

2018-11-11 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Can we untag the file using apertium
> to get the text close to the original?

Just strip the analysis?

The quick and hacky way would be something like this (untested):

sed 's%/[^$]*[$]%%g' | tr -d '^'

You'll remove a bit too much if there were slashes and such in input,
but maybe it doesn't matter too much if you're just checking things?
If you need it to be correct, you can do it in python with
apertium-streamparser – it'll be slower, but you should be able to get
back to the exact input you had.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Special tag for all unrecognized symbols

2018-11-13 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> Hello!
>
> There are so many symbols that are not recognized by Apertium's tagger and
> not marked in any way. For example, apertium-tat does not recognize the
> following symbols:
> _ @ % ~ |
> and many others.
>
> Is it possible to use some special tag (^_/_$) for such cases?

Yes, just give them analyses in tat.dix, e.g.:

[_@%~|]

(untested)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Special tag for all unrecognized symbols

2018-11-13 Thread Kevin Brubeck Unhammer
mansur <6688...@gmail.com> čálii:

> I don't know how and where to file and issue for that. I was gonna to
> create an issue in apertium-tat, but it is a global thing, not just for
> Tatar...

lttoolbox:

https://github.com/apertium/lttoolbox/issues/new


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] case-of work

2018-11-29 Thread Kevin Brubeck Unhammer
Sevilay Bayatlı 
čálii:

> Hi all,
>
> how should  work inside ?
> It's un-logical to place it there since clip will do the job because case
> of only return one of three strings : aa , AA , Aa.
> The right thing to do is just change the string returned by case of by the
> right part of modify-case , which is useless

I'm not sure what you're asking, but this snippet will make the second
lexical unit get the case of the first one:

  


But I don't normally use it; I typically place the case in a variable,
and make that the chunk case, e.g.



  
  




  
  …

I suppose that  might have been written

  

instead, haven't tried.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] let issue

2018-12-01 Thread Kevin Brubeck Unhammer
Sevilay Bayatlı 
čálii:

> hi,
>
> In the let element,  I know that the second "right" part can be any part
> that generates a string like : get-case-from , case-of , b , var , lit ,
> lit-tag , concat and clip
> But for the first "left" part , it's written that it could be clip , var ,
> etc.
> And I can't see any thing other than clip and var to assign the right part
> to it. So what's meant by "etc."?
>
> here form Documentation of the Open-Source Shallow-Transfer Machine
> Translation Platform Apertium
> 3.5.4.37 Element for assignment 
> The assignment instruction  assigns the value of the right part of
> the assignment (a literal string, a clip, a variable, etc.) to the left
> part (a
> clip, a variable, etc.).

See the file /usr/share/apertium/transfer.dtd for what can actually go
in a valid t1x file (or interchunk/postchunk.dtd for those).

Relevant parts of the file:






so *only* var and clip, there are no cetera.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Current GSOC ideas

2019-01-30 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> where can we read
> anything about weight in lttoolbox?

Put the w attribute on an  like 



and use

lt-proc -L1

to output only the 1 best weight classes.

I believe unmarked  is the same as , and lower is better.

Use lt-proc -W to show the weights (see lt-proc --help; it's missing
from the manual https://github.com/apertium/lttoolbox/issues/40 ).

I've been considering turning it on in nno.dix generation, to prefer
forms that have the same upper/lower-case as their lemma. It comes at a
slight cost to compilation time and fst size though (and I compile quite
often …).


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Current GSOC ideas

2019-01-30 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Thanks, Kevin, but that's really a short description. Is there any
> explanation in the wiki or elsewhere?
> Hèctor

What do you want to know about it?

I put my short explanation + the ideas for what you can do with it from
"Work to be done" at
http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission
at http://wiki.apertium.org/wiki/Lttoolbox/weights – feel free to fill
that in :)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] spanish->english- generate issue

2019-02-02 Thread Kevin Brubeck Unhammer
Those error messages should be investigated – are you calling a macro
with more params than it accepts? (npos attribute)

Sevilay Bayatlı 
čálii:

> hi,
>
> I am using this command  to generate the English sentences for
> spanish->english pair, is this consider a problem or it is fine?
>
> apertium-postchunk
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x
> /media/sevilay/SAMSUNG/apertium-eng-spa/spa-eng.t3x.bin
> interchunk_ax00.txt| lt-proc -g
> /media/sevilay/SAMSUNG/apertium-eng-spa/spa-eng.autogen.bin | lt-proc -p
> /media/sevilay/SAMSUNG/apertium-eng-spa/spa-eng.autopgen.bin >
> transfer_ax00.txt
>
>
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 228: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 228: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 228: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in
> /media/sevilay/SAMSUNG/apertium-eng-spa/apertium-eng-spa.spa-eng.t3x: line
> 182: index > limit
> Error in /media/sevilay/SAMSUNG/a
>
>
>
> Sevilay
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Translator crash in apertium.org

2019-02-16 Thread Kevin Brubeck Unhammer
Catalan→French works, and the logs say it's been working at least since
February 12th – 351 requests, all marked as OK / http code 200.

French→Catalan however keeps giving empty output. I see the pipeline
includes "lsx-proc", which doesn't seem to handle NUL flushing: 

echo -e 
'^avoir$[][\n]\0^avoir$[][\n]\0' | 
lsx-proc -z fra-cat.autosep.bin
[][
][][
]

$ lsx-proc --help
Error: Cannot open file '--help'.

$ man lsx-proc 
No manual entry for lsx-proc
See 'man 7 undocumented' for help when manual pages are not available.




Hèctor Alòs i Font 
čálii:

> The translator from Catalan to French in apertium.org is not working for
> some days at least. Probably it crashed. Could someone restart it, please?
> It doesn't happen for the first time. Is it possible to get some
> information from the crash and open an issue in GitHub in the right place?
> Thanks in advance.
> Hèctor
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Translator crash in apertium.org

2019-02-16 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> The translator from Catalan to French in apertium.org is not working for
> some days at least. Probably it crashed. Could someone restart it, please?
> It doesn't happen for the first time. Is it possible to get some
> information from the crash and open an issue in GitHub in the right place?

Catalan→French works, and the logs say it's been working at least since
February 12th – 351 requests, all marked as OK / http code 200.

French→Catalan however keeps giving empty output. I see the pipeline
includes "lsx-proc", which doesn't seem to handle NUL flushing: 

echo -e 
'^avoir$[][\n]\0^avoir$[][\n]\0' | 
lsx-proc -z fra-cat.autosep.bin
[][
][][
]

$ lsx-proc --help
Error: Cannot open file '--help'.

$ man lsx-proc 
No manual entry for lsx-proc
See 'man 7 undocumented' for help when manual pages are not available.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] phonology across word boundaries for HFST generator?

2019-03-01 Thread Kevin Brubeck Unhammer
Jonathan Washington
 čálii:

> Hi all,
>
> I have some students trying to trigger alternations across word boundaries
> like the following:
> inh japỹ / ijapỹ
> inh tũ / isũ
>
> These alternations consistently triggered with certain common words that
> end in "nh".
>
> They're using lexc/twol for the morphological generator.
>
> Our first approach was to put a literal ~ in "inh", i.e., the form was
> "i~nh".  This successfully triggered  in the post-dix, though we got
> slightly mangled output:
> ij\/japỹ (or similar)

That slash seems like a bug. Could you post the exact input to
"lt-proc -p" (output of your generator) and the post.dix?

> Also, this isn't quite an ideal approach.  I suppose we could fairly easily
> automate the insertion of ~ before every nh in the lttoolbox (bin) version
> of the HFST transducer.  But it still seems to be somewhat buggy.
>
> Are there any other solutions that people have gotten to work?

IIUC, those kinds of word boundary-crossing changes are exactly what the
postgenerator is supposed to handle, though it is annoying to have to
insert the mark. I've been manually inserting the  on double
consonants at the ends of words that can compound (to avoid getting
triple consonants if the next word starts with the same one), but manual
is error prone, and it's noisy in the .dix file.

Is there any reason postgen couldn't just run on *everything* LRLM and
only apply the changes where it matches (as if it were a version of sed
that respects deformatting)? Then you could just do
inht is
in post.dix and have no changes to the hfst.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] phonology across word boundaries for HFST generator?

2019-03-01 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

> El 2019-03-01 12:41, Kevin Brubeck Unhammer escribió:

[...]

>> IIUC, those kinds of word boundary-crossing changes are exactly what
>> the
>> postgenerator is supposed to handle, though it is annoying to have to
>> insert the mark. I've been manually inserting the  on double
>> consonants at the ends of words that can compound (to avoid getting
>> triple consonants if the next word starts with the same one), but
>> manual
>> is error prone, and it's noisy in the .dix file.
>>
>> Is there any reason postgen couldn't just run on *everything* LRLM and
>> only apply the changes where it matches (as if it were a version of sed
>> that respects deformatting)? Then you could just do
>> inht is
>> in post.dix and have no changes to the hfst.
>>
>
> I think that would be a wonderful idea!

https://github.com/apertium/lttoolbox/issues/42

(GSoC C++ applicants might want to try their hand at that one.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] French to Catalan is not working

2019-03-06 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Thanks, Tino. Could we know which is the module that crashes and, even
> better, at least one translation which makes it crash?

I wrote that here:
https://sourceforge.net/p/apertium/mailman/message/36588699/
The quick fix is to make a version of the translator that doesn't use
lsx-proc.

The better solution is to fix NUL flushing in apertium-separable.

GSoC applicants who wish to prove their C++ skills look here:
https://github.com/apertium/apertium-separable/issues/1#issuecomment-464338745


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: nno-nob 1.2.0, swe-dan 0.8.0, swe-nor 0.3.0, dan-nor 1.4.0

2019-03-11 Thread Kevin Brubeck Unhammer
God aftan,

New versions of the four Scandinavian pairs are now available from
SourceForge, Github and apertium.org.

These releases come courtesy of Nynorsk pressekontor / NPK (an enclave
of Nynorsk journalists working within NTB, the Norwegian News
Agency[1]), with funding from the Norwegian Ministry of Culture. There
has been some press about the project.[2][3]

NPK have been using apertium-nno-nob in production since fall 2018 –
it's integrated into their translation/editing systems – and we've been
continually improving it with the help of their post-edits and
feedback. The form/spelling/style choices used by nob→nno are now more
modern and uniform (there was a major release of Nynorsk[4] back in
2012, while most style decisions in the translator were made in the
first release back in 2009).

Other major changes to nno-nob:
- 35 new transfer rules[5]
- 248 new lrx rules
- about 42.000 new names and 3.800 new non-names added to bidix
- regression testing by checking that WER does not drop
- lots of work on nob disambiguation
- we now do long-distance adjective congruence
- there's a post-nno.dix to get rid of triple consonants resulting from
  compounding
- compounding happens on proper nouns too now
- genitives are translated not just by preposition-rewriting, but we now
  also have:
  - lists of exceptions where we want to keep genitives
  - rewriting some nouns with relatives
  - rewriting nationalities with adjectives
  - rewriting some abstract nouns into compounds

The project is not yet done, but people have been asking about when the
fruits of it will show up on apertium.org :-)

The other three pairs have also had improvements since last release;
some were also getting pretty bad testvoc-issues due to changes in
dependencies[6], so they get releases too. Apart from testvoc, the pairs
have gotten some transfer rules and fixes merged in from nno-nob
(e.g. prop compounding, and handling genitives in coordinated NP's), and
various disambiguation and vocabulary updates.


-Kevin


[1] https://en.wikipedia.org/wiki/Norwegian_News_Agency
[2] 
https://www.medier24.no/artikler/na-blir-det-nynorsk-bonanza-i-ntb-splitter-ny-robot-oversetter-artikler-automatisk-fra-bokmal/440934
[3] 
https://framtida.no/2018/08/08/nynorskrobot-ei-god-loysing-for-a-dekke-nynorskprosenten
[4] http://www.sprakradet.no/upload/Brosjyrer/Ny%20nynorskrettskriving.pdf
[5] One of which required a bugfix to apertium-transfer

https://github.com/apertium/apertium/commit/542de014a93c96905198f193e0a62a89317fa8a9
[6] https://github.com/apertium/apertium-packaging/issues/12




___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Released: nno-nob 1.2.0, swe-dan 0.8.0, swe-nor 0.3.0, dan-nor 1.4.0

2019-03-12 Thread Kevin Brubeck Unhammer
Ooh, forgot to mention, median/mean WER on test set of 1135 NTB news
articles that were post-edited with the version from this January,
evaluated with different versions of apertium-nno-nob:

  | git date   | median WER | mean WER | stdev |
  |++--+---|
  | 2018-10-01 |  11.79 |12.96 |  7.49 |
  | 2018-10-31 |   9.68 |10.96 |  7.28 |
  | 2018-12-20 |   7.26 | 8.52 |  7.05 |
  | 2019-02-28 |   6.77 | 8.04 |  7.04 |

(apertium-eval-translator was run once for each of the 1135 articles,
for each of the checkouts of the translator+deps)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] add ambiguous weighted rules to apertium-transfer challenge

2019-03-12 Thread Kevin Brubeck Unhammer
Aboelhamd Aly
 čálii:

> I forked apertium core and then added and modified some files and it's now
> ready in my forked repo, you can take a look here
> https://github.com/aboelhamd/apertium

Just a little note on git usage: You should do your development on a
non-master branch. You can create one called e.g. ambigrules

git checkout -b ambigrules

and then, assuming your https://github.com/aboelhamd/apertium is called
"origin" in your checkout, do

git push --set-upstream origin ambigrules

once to make your https://github.com/aboelhamd/apertium the default
target of "git push" in that branch.

Then you can open a Pull Request from your ambigrules branch to the
master of https://github.com/apertium/apertium . You can open the PR
before your changes are ready to be merged, if you'd like some feedback
on it (maybe put what you posted on the list there even).

(Your own master branch you should reset to apertium/master, in case you
ever want to make unrelated changes (e.g. some quick fix that should be
in there before your ambigrules branch is merged), and to make it easier
to merge in changes from apertium/master.)


-Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] apertium-fra-cat new version

2019-03-15 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> As Xavi Ivars said, the bug in apertium-separator is making the
> apertium-fra-cat language pair pretty unusable, so I have removed it from
> the pipeline in modes.xml. I've also passed a testvoc, so a new version is
> ready to be released. @Tino Didriksen 
>  could you
> prepare it, please?
>
> In fact, apertium-separator was experimentaly used only for a few cases in
> both sides of this pair. It may be very useful for both sides of the pair,
> especially for dealing with French double negative clauses. So, when
> dropping its use, practically the quality of the translations does not
> decrease with respect to the previous version. On the contrary, the new
> version introduces a few more words, disambiguation rules, lexical
> selection rules and transfer rules.

🎉

Xavi has a fix on the way for the problem, btw:
https://github.com/apertium/apertium-separable/issues/1#issuecomment-471121936
but apertium-separable needs a new release I suppose



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Ubuntu 14.04 EOL. And 32bit EOL?

2019-05-21 Thread Kevin Brubeck Unhammer
Tommi A Pirinen
 čálii:

> On Tue, May 21, 2019 at 03:39:35PM +0200, Tino Didriksen wrote:
>> Ubuntu 14.04 Trusty Tahr reached EOL on April 30th, and thus removed from
>> packaging.
>
> I think some travis configs, like ones that have copy/pasted mine will
> have dist: trusty set in their configs, this seems to fail now and
> needs to be replaced dist: xenial now. I do not remember why I had to
> hard-code the dist originally but I'm guessing we'll find out soon.

I've fixed some, but there are a few left:
https://github.com/search?l=&q=org%3Aapertium+trusty+filename%3A.travis.yml&type=Code


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-13 Thread Kevin Brubeck Unhammer
Nick Howell  čálii:

>> +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
>
> unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of
> whitespace in the filename).

https://www.shellcheck.net/ would've told you that – please everybody
run shellcheck on any shell scripts you commit :)

You can get inline suggestions in Atom, VSCode, Emacs, Vim, Sublime,
Geany and probably other editors:
https://github.com/koalaman/shellcheck#user-content-in-your-editor


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Does anyone need the &entities; from the "html" reformatter?

2019-06-25 Thread Kevin Brubeck Unhammer
Hi all,

Currently, there are two html-formats, "html" and "html-noent". This is
the difference between them:

$ echo å | apertium -f html-noent nob-nno
å

$ echo å | apertium -f html   nob-nno
å

ie. the one named "html" replaces some (but not all!) non-ascii
characters with xml &entities.

I believe the "html" behaviour is more unexpected/surprising to new
users, and it would make sense to make the "html-noent" behaviour the
default, while renaming the other one to something like "html-ent", so
that we could get:

$ echo å | apertium -f html nob-nno
å

$ echo å | apertium -f html-ent nob-nno
å

(and keeping "html-noent" around as an alias to avoid breakage)

But first: Does anyone have hard-to-change scripts or programs that
depend on the current behaviour (while still needing up-to-date apertium
versions)?



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] question about -z and null flush behaviour

2019-07-16 Thread Kevin Brubeck Unhammer
Jonathan Washington
 čálii:

> Hi all,
>
> I understand that -z was added to a number of Apertium and Apertium-related
> programs, allowing them to continue to accept input and simply flush
> buffers when a null character is encountered.
>
> Question 1.  Is it right that Apertium itself does not follow this
> behaviour?  To use an existing mode with this behaviour, one would have to
> add -z to each relevant command (as done in APy -
> https://github.com/apertium/apertium-apy/blob/master/apertium_apy/utils/translation.py#L145
> ), including wrapping in the appropriate de- and reformatters?

If you mean the shell script /usr/bin/apertium, then it is correct that
it does not add -z anywhere.

> Question 2.  What sort of modification would need to be made to a
> miscellaneous program in the pipeline for it to behave this way?  Can
> anyone point to an example diff of this sort of modification?

https://github.com/khannatanmai/apertium-anaphora/commit/10ba079536c12e37d164c96804afbd3b2d568cf6
is one way, though I think more tools use the *_wrapper_null_flush idiom
as in
https://github.com/apertium/apertium/blob/master/apertium/transfer.cc#L1889


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Little problem in cat-ita make

2019-07-23 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Hello, world!
>
> I am preparing the ita and cat-ita packages to publish them. In principle,
> everything seems to work fine, but from the beginning I get an error when I
> make the first "make" after installing or doing "make clean", for instance.
> I then run the make for the second time and it does not give any errors,
> but it's ugly. I suspect there must be some problem in file permissions.

Is this still an issue?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Little problem in cat-ita make

2019-07-26 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Epl dt., 23 jul. 2019, 14.31, Kevin Brubeck Unhammer 
>  va
> escriure:
>
>> Hèctor Alòs i Font 
>> 
>> čálii:
>>
>> > Hello, world!
>> >
>> > I am preparing the ita and cat-ita packages to publish them. In
>> principle,
>> > everything seems to work fine, but from the beginning I get an error
>> when I
>> > make the first "make" after installing or doing "make clean", for
>> instance.
>> > I then run the make for the second time and it does not give any errors,
>> > but it's ugly. I suspect there must be some problem in file permissions.
>>
>> Is this still an issue?
>>
>
> I still have this kind of problem.

I'm not able to reproduce this – on fresh git clones of
{cat,ita,cat-ita}, I build as usual and try various combinations of
make, make clean and sudo make install. Could you have some root-owned
files left over there? (`ls -l *mode*` should show that) Or see if it
still happens on a clean checkout (`apertium-get cat-ita` is probably
the fastest way to get one)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Little problem in cat-ita make

2019-07-28 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Missatge de Kevin Brubeck Unhammer  del dia dv., 26 de
> jul. 2019 a les 13:54:
>
>> I'm not able to reproduce this – on fresh git clones of
>> {cat,ita,cat-ita}, I build as usual and try various combinations of
>> make, make clean and sudo make install. Could you have some root-owned
>> files left over there? (`ls -l *mode*` should show that) Or see if it
>> still happens on a clean checkout (`apertium-get cat-ita` is probably
>> the fastest way to get one)
>>
>
> Sorry, Kevin. I'm not  able to reproduce it myself now. I've also tried
> with apertium-ita, and in works fine. I can't understand how I got this
> error. In any case, it seems I did something wrong in my own copy. Sorry
> for the false alarm and the time you wasted.
>
> So, hopefully there were not problems to publish this language pair.

No problem; seems publishable as far as makefiles go at least :-)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for review - Unsupervised weighting of automata patches

2019-08-16 Thread Kevin Brubeck Unhammer
Tino Didriksen 
čálii:

> Regarding streamparser, yes you should reuse our existing packages. Use
> libraries, use packages, don't duplicate code.

But do check that it doesn't slow things down (if all you need is one
clean lexical unit per line, cleanstream is about 5x faster).

> As for cleanstream, I was sure that had been packaged, but seems not. I'll
> put that on the list...

Small enough to put in main apertium repo?
cf. https://github.com/apertium/organisation/issues/9

-Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Portuguese-Catalan (almost) ready for packaging

2019-08-28 Thread Kevin Brubeck Unhammer
This sounds great :-) Any idea about WER?

Hèctor Alòs i Font 
čálii:


[...]

> /usr/bin/install: no se sobreescriurà
> '/usr/local/share/apertium/apertium-por-cat/apertium-por-cat.cat-por.t1x',
> tot just creat, amb 'apertium-por-cat.cat-por.t1x'
> Makefile:378: recipe for target 'install-apertium_por_catDATA' failed
> make[1]: *** [install-apertium_por_catDATA] Error 1
> make[1]: Leaving directory '/home/hector/apertium/apertium-por-cat'
> Makefile:581: recipe for target 'install-am' failed
> make: *** [install-am] Error 2

This should be fixed in newest git :)

> An additional problem is what to do with the current apertium-pt-ca in
> GitHub. The existence of both apertium-pt-ca and apertium-por-cat is a mess
> for users, among others, for beta.apertium.org. At least apertium-pt-ca
> should lose its "trunk" label in GitHub.

I removed the «apertium-trunk» label – should it have some other label?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Portuguese-Catalan (almost) ready for packaging

2019-08-28 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Thanks, Kevin!
> About the other question, I have no idea, but the case of apertium-ca-it
> (which has to be substitued by apertium-cat-ita) is the same. I cannot
> understand has has been done with it:
> https://github.com/search?q=apertium-ca-it

It seems that ca-it has been *renamed* cat-ita, so now
https://github.com/apertium/apertium-ca-it redirects to
https://github.com/apertium/apertium-cat-ita

> By the way, apertium-por-cat should probably be labeled as trunk.

done :)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] genv*dix.py need conversion

2019-09-12 Thread Kevin Brubeck Unhammer
Jonathan Washington
 čálii:

> I'm curious about metalrx.py.  Where is it used?  What does it do?  Is it
> documented anywhere?  I've been using an xslt file I found in a Sámi pair
> for creating lrx from metalrx.

The Sámi pairs use an XSLT script
https://github.com/apertium/apertium-sme-sma/blob/master/metalrx-to-lrx.xslt
that two extra functions to lrx files:

1. , see example at
   
https://github.com/apertium/apertium-sme-sma/blob/master/apertium-sme-sma.sme-sma.metalrx#L3
   for often-used sequences (that file just has a single , but you
   could have several in a row)

2.  you can wrap around a  or  or  to
   repeat it up to n times:
   
https://github.com/apertium/apertium-sme-sma/blob/master/apertium-sme-sma.sme-sma.metalrx#L2184



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] genv*dix.py need conversion

2019-09-12 Thread Kevin Brubeck Unhammer
Tino Didriksen 
čálii:

> https://github.com/apertium/apertium-por-cat uses both metalrx.py and the
> XSLT.
>
> I have moved all these shared scripts and XSLTs to apertium (
> https://github.com/apertium/apertium/tree/master/scripts ), and will be
> updating languages/pairs to use them from there.

So it seems the python script 
https://github.com/apertium/apertium/blob/master/scripts/apertium-metalrx
lets you do  with templates like {{pfoo}}:
https://github.com/apertium/apertium-por-cat/blob/master/apertium-por-cat.cat-por.metalrx#L5
that the caller can replace:
https://github.com/apertium/apertium-por-cat/blob/master/apertium-por-cat.cat-por.metalrx#L3223

Perhaps
https://github.com/apertium/apertium/blob/master/scripts/apertium-metalrx-to-lrx.in
should call both the XSLT and the python script? It seems they should be
compatible, doing nothing if the special features aren't used. (If so,
scripts/apertium-metalrx should probably be named somethingelse.py.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] genv*dix.py need conversion

2019-09-14 Thread Kevin Brubeck Unhammer
Daniel Swanson
 čálii:

> With regards to metalrx features, wouldn't it be better (and in some ways
> easier) to incorporate them directly into lrx-comp?
>
> I just opened a PR for adding :
> https://github.com/apertium/apertium-lex-tools/pull/32.  and
>  would be a bit trickier, but should still be fairly doable.

👍 I'd definitely prefer that – my metalrx.xslt was never meant as
anything but a stopgap until Someone fixed it in C++, thanks =D


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-17 Thread Kevin Brubeck Unhammer
Jaume Ortolà i Font
 čálii:

> Hi,
>
> I would like to be able to translate automatically certain words formed by
> "a certain prefix + a certain POS" without having to add new entries to the
> dictionaries. For example, any word formed by "anti" + any valid adjective
> in translations spa<>cat:
>
> antihúngaro <> antihongarès
> antihúngaras <> antihongareses
> antialemán <> antialemany
> antipluvial <> antipluvial
> antiestatista <> antiestatista
> ...
>
> The word forms and the POS tags would remain unchanged. (But in some
> languages some spelling changes may be necessary. In Spanish: "anti + ruso
> " becomes antirruso.)
>
> This feature could be used in a lot of language pairs. Has it been
> implemented anywhere? How could it be done?

You could have a  prepended to every ,


  anti
  

alemán

That would be similar to what people do with HFST.

-

In nno-nob I use the compounding feature of lttoolbox instead. The
relevant parts of the pardefs:



  
  


   



 
  


 


  anti
alemán


Then "anti" alone doesn't get an analysis (compound-only-L can only give
an analysis in compounds), but it can be analysed as a
prefix, if you use lt-proc with the -e argument:
^anti+alemán$

Pretransfer turns this into two lu's

^anti$ ^alemán$

The tags  and  are "special" – a compound
analysis can be made of one or more L's followed by an R. The tags are
hidden from the output when you use lt-proc -e.


The downside to this method is that every right-hand-side needs the tag
 on it, so if you had


 


that needs to be


 


etc.

You will also need transfer rules to remove the space added by
pretransfer, and chunk it etc.

The upside is that you can combine words without listing everything
twice. If you've only got one prefix, the HFST-like method is probably
better. If you're combining lots, compounding may be worth considering.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Adding invariant prefixes

2019-09-18 Thread Kevin Brubeck Unhammer
Jaume Ortolà i Font
 čálii:

> I have tried adding a mark to the newly formed words and removing it with
> CG if necessary. It works fine.

Why not keep it all the way through the translator? That seems safer to
me, and you don't have to worry that they may not be synonymous.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium Python Module Names

2019-10-14 Thread Kevin Brubeck Unhammer
Tino Didriksen 
čálii:

> https://www.debian.org/doc/packaging-manuals/python-policy/ch-module_packages.html#s-package_names
>
> The package python3-apertium must provide the Python module apertium, but
> it provides apertium_core. I can fix this by either adding an alias
> apertium.py with 'from apertium_core import *' or by renaming the package
> to python3-apertium_core.
>
> I would want source package apertium to own python3-apertium and module
> apertium. That just looks nicer and follows logically. But that's in
> conflict with apertium-python also wanting to own Python module name
> apertium.
>
> The name "apertium" is just too overloaded, and now it's starting to be an
> issue.
>
> What solution are people in favour of?
>
> (same issue with python3-cg3 module constraint_grammar, and python3-hfst
> module libhfst)

I'm guessing python3-apertium is the name of the new library from this
GsoC. What is this other package, apertium-python? apt show gives me
nothing.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium Python Module Names

2019-10-14 Thread Kevin Brubeck Unhammer
Sushain Cherivirala  čálii:

>>
>> I'm guessing python3-apertium is the name of the new library from this
>> GsoC. What is this other package, apertium-python? apt show gives me
>> nothing.
>
>
> python3-apertium as Tino is referring to it is the SWIG/Python bindings for
> https://github.com/apertium/apertium. They currently export a
> `apertium_core` module.
>
> This allows for apertium-python (https://github.com/apertium/apertium-python),
> the
> GSoC project (from this year and last) to export an `apertium` module that
> under the
> hood uses bindings from the `apertium_core`, `lttoolbox`, and
> `constraint_grammer`
> Python packages (aspirationally also `hfst`).
>
> Since "apertium-python" is the more user facing version, having wrappers
> for analysis,
> taggers, translation, etc, it should in my opinion own the `apertium`
> module. Today,
> if you run `pip install apertium`, it is what you get. Running `apt-get
> install python3-apertium`
> should be consistent with it and it would be very odd otherwise.
>
> We chose `apertium_core` in order to avoid the naming conflict but I don't
> have a great
> solution as far as the Debian package goes. There weren't plans to export
> the package
> to Pip as well so we didn't really focus on it.

OK, I agree it'd be nice if the user could do "import apertium" and get
the kitchen sink. Could the debian package names be based on that,
e.g. python3-apertium is the kitchen sink package that depends on
python3-apertium-core?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] odt translation not working

2019-10-23 Thread Kevin Brubeck Unhammer
Xavi Ivars  čálii:

> I found the issue: it's due to a bug introduced last May (), while doing
> some improvements to the `apertium` main script.

Wops, that was me, sorry!

Great that you added tests for it :)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] error

2019-11-21 Thread Kevin Brubeck Unhammer
kiran srigiri  čálii:

> Trying to make after adding words in .dix but hit with this error

Hi,

It takes some practice to read error messages. This one says that it
failed to parse somewhere on/after line 6. In your , you should
have only  elements (that's what "expecting (sdef)+" means), but
instead it found a CDATA in between all those sdefs. CDATA is e.g. plain
text, not xml. You would get such an error if you had e.g.




SOME TEXT HERE



(You may want to try http://wiki.apertium.org/wiki/Apertium-viewer which
has an XML editor that can catch dix errors.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] osx packages now require macos 10.14 or higher

2019-11-25 Thread Kevin Brubeck Unhammer
If you're using Tino Didriksen's nightly/release packages on Mac, you'll
need macos 10.14 or higher to run the newest ones. Otherwise you'll get 
errors like

dyld: lazy symbol binding failed: Symbol not found: chkstk_darwin
  Referenced from: /usr/local/bin/../lib/libpcre.1.dylib (which was built for 
Mac OS X 10.15)
  Expected in: /usr/lib/libSystem.B.dylib

(But you may want to hold off on updating all the way to 10.15, which
apparently removes support for 32-bit programs and other things.)






___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: nno-nob 1.3.0, swe-dan 0.8.1, swe-nor 0.3.1, dan-nor 1.4.1

2019-11-29 Thread Kevin Brubeck Unhammer
Hi,

I've just tagged new versions for release of the Scandinavian pairs,
they should be heading up to apertium.org and Github soon.

As before[1], the work comes courtesy of Nynorsk pressekontor / NPK and
the Norwegian News Agency / NTB, with funding from the Norwegian
Ministry of Culture; this fall they also hired Anja[2] to help out with
nno-nob. NPK has been using apertium-nno-nob successfully for over a
year now in order to create more Nynorsk news content.[3]

Some changes since last March in nno-nob:
- ~600 new names and more than 2000 new non-names added to bidix
- 270 new lrx rules (and we fixed an lrx-proc bug that would sometimes
  let the wrong rule apply)
- 37 new transfer rules, including better handling of coordinations,
  genitives and passives
- corpus-generated bigram-rules for choice of preposition when rewriting
  genitives to prepositional phrases
- compounding on digits
- many fixed expressions added
- many compound epenthetics fixed, partly automatically from corpus
  analyses
- support for using headline markup in disambiguation (if
  apertium-deshtml uses the -o switch)
- more consistent upper/lower-case handling (required a fix[4] to
  cg-proc) 
- lots more work on Bokmål disambiguation (which of course helps any
  pair translating from nob), including some frequency-based fallback
  rules generated from corpus. The rlx file is about 2500 lines longer …
  and split into two in order to do some sentence segmentation first.

The previous release we had median WER just below 7, now it is below 4
(median of 1898 WER tests on 1898 NTB news articles is 3.77 when
comparing post-edits to their inputs; stddev 4.73).

The other Scandinavian pairs and monolingual dependencises have gotten
maintenance releases. There aren't many changes there, though all have
some new words, and passives should behave a bit better in nor→dan.


-Kevin

[1] https://sourceforge.net/p/apertium/mailman/message/36609798/
[2] https://github.com/anjazp
[3] 
https://journalisten.no/karoline-riise-kristiansen-martin-eide-npk/jeg-opplever-at-det-er-gode-vilkar-for-nynorsk-om-dagen/382345
[4] 
https://github.com/TinoDidriksen/cg3/commit/492ecebff80d2bbc68742d01e9cba1c1891d2121



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] macOS Package Download Helper

2019-12-10 Thread Kevin Brubeck Unhammer
Nice =D

I put some pointers at
http://wiki.apertium.org/wiki/Apertium_on_Mac_OS_X#Language_data_packages

Tino Didriksen 
čálii:

> Two new scripts:
> https://apertium.projectjj.com/osx/install-nightly-data.sh
> https://apertium.projectjj.com/osx/install-release-data.sh
>
> Run as e.g.:
> ./install-nightly-data.sh apertium-eng-deu
>
> After which one can run the translation as
> echo 'Hello world' | apertium -d /usr/local/share/apertium eng-deu
>
> Or if one has used install-nightly.sh to get the tools in the same prefix,
> then simply:
> echo 'Hello world' | apertium eng-deu
>
> The scripts can only fetch what's there, which one can see via:
> https://apertium.projectjj.com/osx/nightly/data.php
> https://apertium.projectjj.com/osx/release/data.php
> https://apertium.projectjj.com/pkgs.php (machine readable)
>
> These helpers only fetch data - the normal install-nightly.sh (-release)
> script is needed to fetch tools.
>
> -- Tino Didriksen
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Separate Corpus Repos

2019-12-12 Thread Kevin Brubeck Unhammer
Tino Didriksen 
čálii:

> I would like for corpus and other indirect data to go in separate
> repositories. Basically, if the data is not used during the build, it
> should go elsewhere.

What if it's used during `make test`?

By the same argument, should we remove scripts that are used during
development, but not required for build (stuff that is kept in the dev/
subfolder)? If we get too strict on the requirement of "only things
necessary for build", people may start just not checking in useful
scripts, which to me seems worse. And it's already quite annoying having
to check out three repos just to work on one language pair; if
development depends on corpora repos, you have not just three, but *six*
places where you can forget to git push, or where you have to compare
git logs to review changes.

> We need corpus data under Apertium's control so that we don't rely on 3rd
> parties. However, bundling this data in the languages' and pairs' repos
> means that those repos grow unbounded, especially when the data is changed.

I agree that "big" data shouldn't be in the regular repos, since it
slows down checking them out. But less than a few megabytes of text
won't make much difference to a repo with tens of MB's of .dix entries.

> It also messes up the changelog. I use a script to generate AUTHORS from
> the changelog, because nobody keeps that up to date. But this gets muddied
> when unnecessary data is in the repo.

In general I would want to include annotators as authors, though I can
imagine situations where it's not clear-cut, e.g. where the dataset is
too large or is not quite relevant for developing the rest of the repo.

I think having corpus-xxx and corpus-xxx-yyy repos could be a good
thing, but I don't think we should have a hard requirement of moving
data over there, especially if the data is useful during testing and
development. I do think it makes sense to move larger corpora out, for
faster cloning.


-Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Separate Corpus Repos

2019-12-16 Thread Kevin Brubeck Unhammer
Nick Howell  čálii:

>> I'm not sure if corpora-xxx in the github is the right way to go though.
>>
>> I think it would be better to store them on a web server and either:
>>
>> 1) Have apertium-xxx/text that has a script that will download the corpus
>> from the server and a gitignore to not have it in the repo.
>> 2) Use something like git-annex (this is bit more involved)
>
> git-annex is essentially designed for exactly our use-case. Github and
> Gitlab natively speak a protocol called "git LFS" which git-annex supports.
> So I would be highly supportive of moving in that direction.
>
> I would be happy to help put together a proposal for what that would look
> like, but probably not before the end of the month. Potential problems I
> can see with such a plan are:
> - git-annex has a heavy build dependency set (Haskell)

As with git itself, I hope we don't expect people to build it? :)

> - git-annex depends on stable hashes of the corpus data
> - git-annex packages can be out-of-date outside of debian

Ubuntu 19.10: 7.20190912-1

Fedora 30: 7.20191114

CentOS 7: 5.20140221 (so probably newer than most CentOS software)

OS X: dmg's "autobuilt" from every git version if I read
https://git-annex.branchable.com/install/OSX/ correctly

Windows: "beta" says https://git-annex.branchable.com/install/Windows/
but there are pre-built packages.

That doesn't seem too bad? 


-Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New site design

2019-12-18 Thread Kevin Brubeck Unhammer
Scoop Gracie 
čálii:

> Please check out https://github.com/scoopgracie/apertium-site/! It is my
> new design for an Apertium site. It's a Gatsby  app
> based on React. I haven't documented it yet, but a look at
> https://www.gatsbyjs.org/docs/ should show the basics. Thanks, and please
> submit PRs!

Is it live anywhere for easy testing?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Translators on www.apertium.org

2010-04-19 Thread Kevin Brubeck Unhammer
2010/4/19 Felipe Sánchez Martínez :
>
> Hi all,
>
> Prompsit did not go down, why? because the language pairs offered there
> are stable and tested.
>
> I would like to rise a question. Should we offer the translation between
> developing language pairs at the webpage? IMHO we shouldn't.

But what's the measure? Released pairs? The ones at apertium.org have
all had a release. All language pairs that have reached version 1.0?
That's rather arbitrary… All that have had a thorough testvoc?  All
released pairs _should_ have this. Should one simply let, say, half a
year pass before putting a release on the server, to collect bug
reports? How many people actually download the language packages, run
lots of text through them, and then _report the bugs_?

It seems to me like a better solution is to use ScaleMT, and perhaps
let those language pairs that we, for whatever reason, consider too
untested run on a different server. Unless I completely misunderstood
Victor's presentation last fall, using ScaleMT it should be possible
to keep the web page going even though one server goes down (do you
even need ScaleMT to do that?). Thus developers can get quick feedback
on what's wrong (oh, and apertium.org gets to offer more language
pairs). Of course, this assumes that there is the possibility of
having yet another server…


best regards,
Kevin Brubeck Unhammer

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] issues with apertium service ?

2010-04-20 Thread Kevin Brubeck Unhammer
2010/4/20 Francis Tyers :
> Friedel has noticed a change in the listPairs method, it doesn't seem to
> list pairs, is he doing anything wrong ?
>
>  spectie: Hi.
>  spectie: Aware of any issues with your service at the moment?
>  curl http://api.apertium.org/json/listPairs
> 
> {"responseData":[],"responseDetails":null,"responseStatus":200}

Is api.apertium.org running apertium-service? (there it's
languagePairs, not listPairs) I can't test from my IP ;-)

-Kevin Unhammer

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] how to group categories in monodix

2010-04-23 Thread Kevin Brubeck Unhammer
2010/4/23 sriram :
> Hi,
>
> I want to group two categories into one in monodix.
>
> for e.g
>
> 
> 
>
> into one . Is there any methods to do it in monodix.
>
> I can group the categories in t1x file
>
> for e.g . I can group different kind of verbs  into one "verb".
>
>   
>     
>     
>     
>     
>     
>
>
> Reason: In Hindi many nouns behave as adjectives. Hence to provide both
> "adjective" and "noun" analysis for these words , we can assign them a
> new-category "adj-noun" or something like this.
>
> Thanks,
> Sriram

You can have pardefs that refer to pardefs, if that helps… but could
you give an example?


If I were to do group these for Norwegian, eg. the lemma "norsk" can
be both the adjective and noun (meaning "Norwegian"), I could  either
use one big pardef, like this:



 are   
   
   
 e 
 e 
 aste  
 ast   

 ane   
 ar
 en
   

   norsk


Or I could use a pardef that refers to two other pardefs, like this:



 are   
   
   
 e 
 e 
 aste  
 ast   


 ane   
 ar
 en
   


       
   

   norsk


(With the latter method you might save some lines in your dix?)


hope this helped,
Kevin Brubeck Unhammer

--
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] bug leading to hang in lt-proc

2010-04-23 Thread Kevin Brubeck Unhammer
Hi,

I thought I should mention this bug (or, these bugs?)

http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=104

on the mailing list since it can introduce a hang in lt-proc.


I came across it while messing around with an XSLT to remove certain
's from pardefs.

The first issue reported there is odd but not critical; you can get
spurious analyses if you have a pardef calling two pardefs, where the
last one has only a single, empty .

The second issue, however, can lead to hangs. Neither lt-proc nor
lt-comp nor validation reports anything here (so watch out if you have
empty elements like this).


best regards,
Kevin Brubeck Unhammer

--
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for comments/testing: Rule exceptions.

2010-07-11 Thread Kevin Brubeck Unhammer
2010/7/11 Jimmy O'Regan :
> The attached patch adds a new mechanism to transfer rules: 

This has been on my wishlist for a while =D

> Exception can contain a single  -- if the test evaluates to
> 'true', the current rule is ignored, and the last applicable rule is
> used instead (the implication being that it should only be used in
> rules whose  contains more than one ).
>
[snip]
> Motivation:
>
> The primary motivation was in dealing with Polish: highly inflected
> (few 'markers'), adjectives can come before or after the noun.
> Inflection *usually* gives enough information for proper segmentation,
> but handling it properly would be a matter of having individual rules
> for each gender, case, and number + each combination of words (i.e.,
> multiply number of NP rules by 70). I've seen recently that it would
> help in less inflected languages, so it's probably generally useful.

I just tested it for nb->nn, where I used it to avoid chunking adj.ind
n.def (the adjective is used adverbially, not modifying the noun),
which in some cases can be quite important:

Before:
$ echo Ledelsen liker dårlig fokuset på utøvere som Tommy
Ingebrigtsen|apertium -d . nb-nn
Leiinga likar det dårlege fokuset på utøvarar som Tommy Ingebrigtsen
≈ The management likes the bad focus on athletes such as Tommy Ingebrigtsen

After, correct meaning:
$ echo Ledelsen liker dårlig fokuset på utøvere som Tommy
Ingebrigtsen|apertium -d . nb-nn
Leiinga likar dårleg fokuset på utøvarar som Tommy Ingebrigtsen
≈ The management doesn't like ("likes badly") the focus on athletes
such as Tommy Ingebrigtsen


Of course, one can always acheive the same as  by using
 and duplicating the contents of the single-item rules,
but, well, that means duplicating content… this looks like it would be
a lot simpler to maintain (and less ugly than output macros).


I'm still trying to make it break :)


--
Kevin Brubeck Unhammer

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for comments/testing: Rule exceptions.

2010-07-11 Thread Kevin Brubeck Unhammer
2010/7/11 Jacob Nordfalk :
>
>
> 2010/7/11 Jimmy O'Regan 
>>
>> On 11 July 2010 22:22, Jacob Nordfalk  wrote:
>> >
>> >
>> > 2010/7/11 Jimmy O'Regan 
>> >
>> >>
>> >>  
>> >>    
>> >>      
>> >>        
>> >>        
>> >>      
>> >>      
>> >>        
>> >>          
>> >>            
>> >>              
>> >>              
>> >>            
>> >>          
>> >>        
>> >
>> > Now, I do understand why you chose that way of writing it (its the
>> > easiest
>> > way to implement), but if we adopt  I it would make more
>> > sense to
>> > make the exception a part of the  element, like this:
>>
>> Pattern exceptions is what LanguageTool uses, and there's scope for
>> *also* having that, but I'm interested in runtime-based exceptions
>> that have access to all matched words, to check for agreement.
>
> I think we should stick to runtime exceptions, and not consider pattern
> exceptions at all.
> My point is that for transfer rule developers it would make much more sense
> to percieve it as 'an exception to the pattern'. Now it looks like 'an
> exception to an action', which doesent make too much sense.

You could call it "continue" instead? (Or would that be even more confusing…)

I can imagine scenarios where having it in the action would make the
rule a bit more compact, eg.

…


--
Kevin Brubeck Unhammer

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Language Pair Roundup: fr-pt (GSOC)

2010-07-17 Thread Kevin Brubeck Unhammer
2010/7/17 Sean Healy :
> On 2010-07-16 18:06, Francis Tyers wrote:
>
>>> * What translation errors does it have ?
>>>       The biggest problem in general seems to be the adding of pronouns
>>> to any verb forms in the pt-fr direction. In many instances, the fr
>>> already has a subject, and does not need the pronoun added. Another
>>> large issue is the differing usage of articles between the two languages.
>>> * What things can be fixed, and how ?
>>>       These two problems can most likely be fixed in the transfer rules.
>>
>> Any ideas for what this might look like, describing in words or
>> pseudo-code.
>
> Without having looked at how the rules function at the moment, I'm
> thinking I'd check for nouns that precede 3rd-person verbs in the
> sentence. In most cases, a noun phrase preceding a 3rd-person verb will
> be its subject. Of course, I'd also have to check whether those nouns
> are part of a prepositional phrase, in which case they would not be a
> subject.
>
> The most glaring issue with articles in the example paragraph is that
> the pt paragraph opens with a noun that has no article; the fr wants an
> article. So I'd articles to any noun that's the first word in the
> sentence (and doesn't already have one).
>
>>
>>> * What things probably can't be easily fixed and why ?
>>>       There will probably always be incorrect word choices made by the
>>> system, because even in similar languages such as these two, words will
>>> not transfer one-to-one, and sometimes the system will pick the wrong
>>> word. Altering it to pick the right word in one situation will cause it
>>> to pick the wrong word in another, so the best we can do is reduce the
>>> error rate.
>>
>> How about this difference:
>>
>>   é importante ->  c'est important
>>   é demasiado tarde ->  c'est trop tard
>>
>> Any thoughts on how to deal with that ?
>
> I'm tempted to simply translate "é" at the beginning of a sentence as
> "c'est"; in most cases this will probably work. I'm not sure how I'd
> check whether "é" needed a real subject (in which case it would be "il"
> or "elle", depending on the referent) versus whether it needs a dummy
> subject ("ce").
>
> What kind of referent-checking does apertium do?

Aren't those dummy subject constructions dependent on the main
predicate? Like "it's raining", "it's difficult to explain", the
reason you need "it" is because of "raining" and "difficult to
explain". In sme-nob we tag predicates like that with "impers", and
check for that tag to avoid inserting animate subjects in these
constructions. (Of course then you have to match both eg. the finite
verb and the main verb in transfer.)

--
Kevin Brubeck Unhammer

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for comments/testing: Rule exceptions.

2010-07-24 Thread Kevin Brubeck Unhammer
2010/7/11 Kevin Brubeck Unhammer :
> 2010/7/11 Jimmy O'Regan :
>> The attached patch adds a new mechanism to transfer rules: 
>
> This has been on my wishlist for a while =D
>
>> Exception can contain a single  -- if the test evaluates to
>> 'true', the current rule is ignored, and the last applicable rule is
>> used instead (the implication being that it should only be used in
>> rules whose  contains more than one ).
>>
> [snip]
>> Motivation:
>>
>> The primary motivation was in dealing with Polish: highly inflected
>> (few 'markers'), adjectives can come before or after the noun.
>> Inflection *usually* gives enough information for proper segmentation,
>> but handling it properly would be a matter of having individual rules
>> for each gender, case, and number + each combination of words (i.e.,
>> multiply number of NP rules by 70). I've seen recently that it would
>> help in less inflected languages, so it's probably generally useful.
>
> I just tested it for nb->nn, where I used it to avoid chunking adj.ind
> n.def (the adjective is used adverbially, not modifying the noun),
> which in some cases can be quite important:
>
> Before:
> $ echo Ledelsen liker dårlig fokuset på utøvere som Tommy
> Ingebrigtsen|apertium -d . nb-nn
> Leiinga likar det dårlege fokuset på utøvarar som Tommy Ingebrigtsen
> ≈ The management likes the bad focus on athletes such as Tommy Ingebrigtsen
>
> After, correct meaning:
> $ echo Ledelsen liker dårlig fokuset på utøvere som Tommy
> Ingebrigtsen|apertium -d . nb-nn
> Leiinga likar dårleg fokuset på utøvarar som Tommy Ingebrigtsen
> ≈ The management doesn't like ("likes badly") the focus on athletes
> such as Tommy Ingebrigtsen
>
>
> Of course, one can always acheive the same as  by using
>  and duplicating the contents of the single-item rules,
> but, well, that means duplicating content… this looks like it would be
> a lot simpler to maintain (and less ugly than output macros).

I've noticed a lot more rules that all could do with this ,
at least a fifth of the sme-nob chunking rules have possibilities for
mis-chunking (eg. det.loc + n.ill should not be chunked, but most
other cases of det and n should be chunked), the same for all the
conjunction rules (in the first interchunk) that merge two chunks.

In the above example I could have just added extra almost-identical
rules to cover all the patterns (involving a lot of redundancy), but
if the exception depends on target-language information even that
wouldn't do it. Eg. most verbs both in Bokmål and Sámi have adjective
forms, so we allow Sámi  to enter into ADJ NOM rules. But some
Sámi verbs translate to a certain class of Bokmål verbs (lexicalised
passives) that don't have adj forms, these get the tag  in
bidix, but we can't know that from the ; here the 
would be great.


-Kevin

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for comments/testing: Rule exceptions.

2010-07-24 Thread Kevin Brubeck Unhammer
2010/7/24 Jimmy O'Regan :
> On 24 July 2010 13:15, Kevin Brubeck Unhammer  wrote:
>> I've noticed a lot more rules that all could do with this ,
>> at least a fifth of the sme-nob chunking rules have possibilities for
>> mis-chunking (eg. det.loc + n.ill should not be chunked, but most
>> other cases of det and n should be chunked), the same for all the
>> conjunction rules (in the first interchunk) that merge two chunks.
>>
>
> Not exactly sure what you're saying here.
>
> The main point of this is not to throw lookahead all over place; but
> to stop a pattern from 'stealing' from a real chunk; if there's
> nothing that could follow n.ill that would form a proper chunk, then
> you don't need this, just check as normal and output two chunks from
> the existing rule - probably what you're already doing.

In the same way, in your example with ADJ N ADJ you could have one
rule matching the full three-part thing and outputting either {ADJ N}
ADJ or ADJ {N ADJ}, or even three chunks -- you just need another rule
ADJ N ADJ N for when the second adj modifies a noun.

My DET NOMCMP NOM rule can give one chunk on seeing any of

  
  
  
  
  
  
  

etc., or two chunks on seeing
  

(perhaps also with other combinations that I haven't discovered yet)

and if the last noun is also a compound part, I just need a DET NOMCMP
NOMCMP NOM rule too, which can output either one or two chunks.

It's always possible to fix things with more redundancy. I was just
trying to make a point that this  could lead to much more
maintainable transfer rules.


>> In the above example I could have just added extra almost-identical
>> rules to cover all the patterns (involving a lot of redundancy), but
>> if the exception depends on target-language information even that
>> wouldn't do it. Eg. most verbs both in Bokmål and Sámi have adjective
>> forms, so we allow Sámi  to enter into ADJ NOM rules. But some
>> Sámi verbs translate to a certain class of Bokmål verbs (lexicalised
>> passives) that don't have adj forms, these get the tag  in
>> bidix, but we can't know that from the ; here the 
>> would be great.
>
> It sounds to me like you're just not being precise enough in the
> pattern items, but the whole area of derivational morphology bores me
> to sleep, so maybe I missed something.

Using pattern-items here would mean either adding all verb lemmas
apart from some hundred from the sme dictionary into a pattern-item,
or adding tags to all these in the _sme_ dictionary which record what
they will turn into in bidix. I wouldn't tag nouns in English with
what their gender is in Spanish, that's a bidix job.


--
Kevin Brubeck Unhammer

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] strange goings on in post-generation

2010-07-28 Thread Kevin Brubeck Unhammer
Hi,

I tried making a post-generation dictionary with just one rule




  
  

  
  
  

  e
  e

  
  



but I get slashed output  when I try running it:


$ echo '~el' | lt-proc -p foo.autopgen.bin
e\/el


is this a bug or am I missing something?


--
Kevin Brubeck Unhammer

--
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] strange goings on in post-generation

2010-07-29 Thread Kevin Brubeck Unhammer
2010/7/29 Francis Tyers :
> El dc 28 de 07 de 2010 a les 16:54 +0200, en/na Kevin Brubeck Unhammer
> va escriure:
>> Hi,
>>
>> I tried making a post-generation dictionary with just one rule
>>
>>
>> 
>> 
>>   
>>   
>>     
>>   
>>   
>>       
>>         
>>           e
>>           e
>>         
>>       
>>   
>> 
>>
>>
>> but I get slashed output  when I try running it:
>>
>>
>> $ echo '~el' | lt-proc -p foo.autopgen.bin
>> e\/el
>>
>>
>> is this a bug or am I missing something?
>
> The problem is you need more than one character (for some reason), so
> for example this dictionary would work:
>
> 
> 
>  
>  
>    
>  
>  
>    
>      ll
>    
>  
>  
>      
>        
>          el
>          e
>        
>        
>      
>  
> 
>
> $ echo "~ell" | lt-proc -p /tmp/foo.bin
> el

OK so the workaround if you want to match a single character is using
.:




  
  

  
  
  

  e
  e

.
  
  




$ lt-comp lr foo.dix foo.bin
m...@standard 4 3

$  echo '~e' | lt-proc -p foo.bin
e



--
Kevin Brubeck Unhammer

--
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Interchunk t2x specification

2010-08-02 Thread Kevin Brubeck Unhammer
2010/8/2 Stephen Tigner :
> Okay, I have a question, as I haven't had much of any luck finding it
> on the wiki. Is there anywhere where the the format of the t2x file
> used by interchunk is explained? Especially the differences between it
> and the t1x files used by transfer?
>
> I'm specifically looking for information on the tags, their
> attributes, and what they mean.
>
> Thanks to anyone who can point me in the right direction. ^^

http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium/apertium/interchunk.dtd
is nice as a starting point…


-Kevin

--
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Arch Linux PKGBUILD's now available for (almost) all released language pairs

2010-09-15 Thread Kevin Brubeck Unhammer
Hi,

I just wanted to let people know that I've uploaded AUR packages for
Arch Linux for all released language pairs in Apertium (except for
is-en, which seems to require a newer version of apertium-pretransfer
than what is in apertium-3.1.1). If anyone's running Arch Linux, I'd
be very happy if they could give them a try and let me know where the
bugs are hiding :-)


Also, in making the packages I discovered some problems with certain
pairs; I had to apply the following patches to make these pairs
compile:
apertium-es-ro:
http://aur.archlinux.org/packages/apertium-es-ro/apertium-es-ro/trules.patch
apertium-oc-ca:
http://aur.archlinux.org/packages/apertium-oc-ca/apertium-oc-ca/t1x.patch
apertium-oc-es:
http://aur.archlinux.org/packages/apertium-oc-es/apertium-oc-es/oc-es.t1x.patch

http://aur.archlinux.org/packages/apertium-oc-es/apertium-oc-es/es-oc.t1x.patch
...I'm not sure I got the logic here as intended, these should
probably have a "maintenance release".


For all the pairs, I had to modify the Makefile.am in this manner:

-   $(INSTALL_DATA) $(BASENAME).$(PREFIX2).t1x $(apertium_nn_nbdir)
+   $(INSTALL_DATA) $(BASENAME).$(PREFIX2).t1x 
$(DESTDIR)$(apertium_nn_nbdir)

I think $(DESTDIR) could be in the svn Makefile.am's without causing
any trouble, seems to not make a difference except when creating these
packages.


best regards,
Kevin B. Unhammer

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] maintenance release of lttoolbox and Apertium

2010-09-21 Thread Kevin Brubeck Unhammer
2010/9/21 Francis Tyers :
> For some reason, we had versioned Apertium and lttoolbox as 3.2 in SVN,
> but never got around to making a 3.2 release. There have been some minor
> bugfixes and improvements -- fixing an issue in pretransfer and updating
> the DTDs, and I think it is worth making a 3.2 release -- not least
> because Unhammer wants to release apertium-nn-nb 0.7.0 ;)

Thanks =D
Also, the code for  was added to interchunk/postchunk.cc (it
was in the DTD's but not in the code).


-Kevin

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] apertium-es-an 0.1.0 released

2010-09-26 Thread Kevin Brubeck Unhammer
2010/9/26 Jimmy O'Regan :
> Sun Sep 26 15:42:44 IST 2010
>  * Initial release (0.1.0)
>  * Caveats:
>   - Functions only in an->es direction
>   - Several closed category words missing from an analyser
>     (including "ir")
>   - "Cowboys, Ted!"
>     This system has been put together in a very shoddy, MacGuyver-ish
>     way:
>     The majority of the lexicon has been composed on the basis
>     of presumed cognates. For the most part, this has been
>     restricted to Latin derivatives, but on more than one occasion, I
>     simply went nuts and pulled in anything the Spanish analyser would
>     recognise.
>     The only bitexts available were the UN Declaration of Human Rights
>     and the welcome message for new users of the Aragonese Wikipedia.
>     Statistical methods were not widely employed.
>     To deal with the spelling variations, I abused the heck out of sed,
>     filtering unknowns repeatedly before passing the result through the
>     analyser, to pluck out the results. Much of the ~8000 words in the
>     bilingual lexicon are mere variations. (In a particularly ironic
>     twist, it has 3 variations of 'normalización'). These variants will
>     need to be sorted out to have es->an: the first translation made with
>     this system before release was of the document on an.wikipedia
>     describing the new spelling rules.
>     Although I got some notes from Juan Pablo Martínez on the equivalents
>     of ser and estar, I was not able to get further information. My
>     "solution" is to ignore the issue and come back to it later.
>     Also, Juan Pablo added some vocabulary to the analyser, most of which
>     I have not been able to use for lack of translations. Hopefully, we
>     can get these reinstated soon.
>     A tagger has yet to be trained for Aragonese; during development, I found
>     the Spanish tagger to be sufficient, and so have used that. This is a
>     temporary measure.
>
>
> The release is a little premature, perhaps, but I want a release to
> mark the European Day of Languages. It's not bad for approximately 3
> weeks' work :)
>

Congrats!

And happy Language Day, all :-)

-Kevin

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Drupal module?

2010-09-27 Thread Kevin Brubeck Unhammer
Hi,

I got a question about whether there are any Drupal modules for
Apertium, for making a translated copy of a page (which may then be
post-edited); does anyone know of anything?


best regards,
Kevin B. Unhammer

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Bug in apertium-en-es

2010-10-06 Thread Kevin Brubeck Unhammer
2010/10/6 Miquel Esplà :
> Hi everybody,
> I've found a problem with the version of apertium-en-es in the
> SVN (https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-en-es)
> in the release 25956. It happens taht, when I try to translate a text in
> English with a $ symbol, it disappears in the tranlsation. I've tried to
> translate a file with the only sentence
> hello $ world
> and the result is:
> hello world.
> When I tried the trnalation from Spanish to English it worked, but for
> English to Spanish it fials.
> I am using lttoolbox-3.2.0 and apertium-3.2.0 and the version of
> apertium-es-en in the SVN.
> Can anybody help, please? Cheers,
> Miquel.

If you add $ to the , it will work (and $ will be marked
unknown if you don't use -u). But I'm not sure if this causes other
problems?

best regards,
Kevin Brubeck Unhammer

--
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] News from the mentor summit about GCI -- make tasks specific/Taskset 1: crossdics

2010-10-27 Thread Kevin Brubeck Unhammer
2010/10/27 Jacob Nordfalk :
> Ive looked at http://wiki.apertium.org/wiki/Ideas_for_Google_Code-in
> Do you really think anyone can:
> 1) translate a text of 34,268 bytes (the new language pair HOWTO) into
> another language
> 2)  go through it for a new pair of languages.
> 3) When finished, upload to the Incubator.
> in 2-3 HOURS!??!??
> Well, Ive tried that task when I started out. I might be extraordinary slow
> but just doing step 1) would take me at least half a day for Esperanto.
> Same goes for the other proposals: These <=18 age students must be really
> bright, but in general I would multiply all your estimations with a factor
>>3.
>
> Here is a proposal for what I would consider a realistic task for GCI:
> Add 50 nouns to apertium-sv-da. Check that the words work for boths
> directions (from Swedish to Danish and from Danish to Swedish).
> Time: 14 hours
> (install & compile: 4 hours. Understand the format of the 3 .dix files to
> edit: 2 hours. Adding the words: 4 hours. Checking translation in both
> directions and fix problems: 4 hours).

The time estimates do seem rather low yes. However, I think they're
supposed to reflect only the work that's on that specific task (since
students can work on several tasks, so they won't install apertium for
each task...)

The wiki page also does say "The time column gives the minimum
estimated amount of time that should be spent on the task. It does not
include time taken to install / set up apertium." (now boldfaced, as I
missed it the first time too)



-Kevin

--
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] News from the mentor summit about GCI -- make tasks specific/Taskset 1: crossdics

2010-10-27 Thread Kevin Brubeck Unhammer
2010/10/27 Kevin Brubeck Unhammer :
> 2010/10/27 Jacob Nordfalk :
>> Ive looked at http://wiki.apertium.org/wiki/Ideas_for_Google_Code-in
>> Do you really think anyone can:
>> 1) translate a text of 34,268 bytes (the new language pair HOWTO) into
>> another language
>> 2)  go through it for a new pair of languages.
>> 3) When finished, upload to the Incubator.
>> in 2-3 HOURS!??!??
>> Well, Ive tried that task when I started out. I might be extraordinary slow
>> but just doing step 1) would take me at least half a day for Esperanto.
>> Same goes for the other proposals: These <=18 age students must be really
>> bright, but in general I would multiply all your estimations with a factor
>>>3.
>>
>> Here is a proposal for what I would consider a realistic task for GCI:
>> Add 50 nouns to apertium-sv-da. Check that the words work for boths
>> directions (from Swedish to Danish and from Danish to Swedish).
>> Time: 14 hours
>> (install & compile: 4 hours. Understand the format of the 3 .dix files to
>> edit: 2 hours. Adding the words: 4 hours. Checking translation in both
>> directions and fix problems: 4 hours).
>
> The time estimates do seem rather low yes. However, I think they're
> supposed to reflect only the work that's on that specific task (since
> students can work on several tasks, so they won't install apertium for
> each task...)
>
> The wiki page also does say "The time column gives the minimum
> estimated amount of time that should be spent on the task. It does not
> include time taken to install / set up apertium." (now boldfaced, as I
> missed it the first time too)

Regarding that specific task; would it be better to split it into two?
One for translating the howto, another for going through it for a new
pair? I imagine that most of those who translate it into languages
that the mentors know, know only languages that the mentors know and
thus languages that are well-represented in apertium already... ;-)


-Kevin

--
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] diccionarios de Apertium

2010-11-25 Thread Kevin Brubeck Unhammer
jgime...@lsi.upc.edu writes:

[...]

>>> El 24/11/10 17:15, Jesús Giménez escribió:
>>> > de momento, he estado echándole un vistazo y creo q lo más sencillo
>>> será
>>> > usar apertium-dixtools para leer los ficheros .dix
>>> >
>>> > ni qué decir tiene q cualquier sugerencia por tu parte será bien
>>> recibida!
>>> >
>>> > muchas gracias,
>>> >
>>> > jesus
>>> >
>>> >
>>> > ps: por cierto, al hacer check-out de todo apertium subversion me ha
>>> > dado un problema de encoding -->
>>> >
>>> > svn: Can't convert string from 'UTF-8' to native encoding:
>>> > svn: apertium/apertium-nn-nb/dev/dansknorsk-h?\195?\184gnorsk-todo.dix

(Sorry for replying in English)

Does the problem only occur with this file? That is only a "scratch"
file which should not compile in any case (it does not even validate);
in general, files in "dev" folders are likely to have errors...

best regards,
Kevin Brubeck Unhammer

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Compound words and dix format

2010-12-19 Thread Kevin Brubeck Unhammer
Francis Tyers  writes:

> Now we have the java compound word implementation ported to C++ we can
> probably consider this 'de facto' how we are going to do compounds in
> lttoolbox -- it is _in use_ and there have been _no alternatives_. 
>
> So it is probably worth looking at how we are going to represent this
> nicely in the .dix format. At the moment we use two 'special' symbols:
>
> 
> 
>
> I propose making a new element  for compound, and having one
> attribute "r" for restriction.
>
>  would be replaced with  and 
> 

I think it would be better if elements with  are, like
, "compound-only". As the examples below show, an element
marked  now both allows use in compounds and out
of compounds, while  marks a path that's only
reachable in compounds. I think new users would find it less confusing
if they mean the same thing, even though it requires a slightly more
explicit dix file. So instead of

>   plastplast n="ind"/>
>   plastplast n="ind"/>
>   kortetkort n="def"/>

you would have to have

>   plastplast n="ind"/>
>   plastplast n="ind"/>
>   kortetkort n="def"/>
>   kortetkort n="def"/>

(Note the beautiful symmetry.)


The original reason for having this difference was that we so far have
no examples of forms that can be compound-R but not words on their own,
so having those extra identical lines means longer dix files. 

However, lttoolbox has this wonderful feature called pardefs :) So what
the line for "kortet" really looks like is this:

 kortetkort

where 


   

   




So, if we're deciding on specifications, that's the only thing I'd like
to see changed. 


-Kevin


-- 

Sent from my Emacs


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Kevin Brubeck Unhammer
Francis Tyers  writes:

> Hi!
>
> The problem with this is that there are so many different metadix
> formats that it will be impossible to come up with one that covers them
> all. For example if I remember correctly how the "alt" works is
> different in es-pt and in oc-es. I think it was decided that it was
> desirable to have them functioning differently, or at least would
> require substantial changes in either language pair to get a unified
> format -- changes that without some push (and let's face it, cash) are
> not going to get made. 
>
> On the other hand, implementing compound words gives us the chance to
> strike while the iron is hot! We can make a (fairly innocuous change --
> any language pair that does not have compounding will be unaffected)
> before getting a plethora of different options and thus avoiding the
> metadix problem for another set of issues.
>
> Btw, thinking about metadix I have some probably unpopular ideas,
> thatwould preclude any standardisation. I think that maybe we should not
> have one format, but rather many _codified_ formats depending on the
> language(group). For example how to include a verb would be different in
> Tajik and Dutch, because different things are important. Unnecessary
> examples:
>
>  pp="aangezeten"/>
>
> Giving:
>
> aanz
> zaanz n="z/itten#_aan__vblex_sep"/>aan
> aangezetenaanzitten n="gesproken__vblex_sep"/>
>
> Or in Tajik:
>
> 

In the unification proposal from

http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal

the calls would look like



and




Are there good reasons not to go with that kind of syntax?


-- 
Kevin Brubeck Unhammer

--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] modify-case "aa" on uppercased input

2011-01-05 Thread Kevin Brubeck Unhammer

Hi,

Is there a bug in 


  
  


when the input is all uppercase, or am I using it wrong?


wget http://apertium.codepad.org/GdrOe3nL/raw.txt -O problem.t1x
wget http://apertium.codepad.org/wo597sse/raw.txt -O problem.dix
lt-comp lr problem.dix problem.dix.bin
apertium-preprocess-transfer problem.t1x problem.t1x.bin
echo '^GUOKTE$' | apertium-transfer problem.t1x problem.t1x.bin 
problem.dix.bin 


gives


^det{^tO$}$


whereas I was expecting to see


^det{^to$}$


-- 
best regards,
Kevin Brubeck Unhammer


--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] modify-case "aa" on uppercased input

2011-01-05 Thread Kevin Brubeck Unhammer
Francis Tyers  writes:

> El dc 05 de 01 de 2011 a les 09:32 +0100, en/na Kevin Brubeck Unhammer
> va escriure:
>> Hi,
>> 
>> Is there a bug in 
>> 
>> 
>>   
>>   
>> 
>> 
>> when the input is all uppercase, or am I using it wrong?
>> 
>> 
>> wget http://apertium.codepad.org/GdrOe3nL/raw.txt -O problem.t1x
>> wget http://apertium.codepad.org/wo597sse/raw.txt -O problem.dix
>> lt-comp lr problem.dix problem.dix.bin
>> apertium-preprocess-transfer problem.t1x problem.t1x.bin
>> echo '^GUOKTE$' | apertium-transfer problem.t1x problem.t1x.bin 
>> problem.dix.bin 
>> 
>> 
>> gives
>> 
>> 
>> ^det{^tO$}$
>> 
>> 
>> whereas I was expecting to see
>> 
>> 
>> ^det{^to$}$
>
> I think the code that deals with this is in transfer.cc 
>
> string
> Transfer::copycase(string const &source_word, string const &target_word)
>
> I'm struggling to make heads or tails of that though. In the en-ca
> rules, you find:
>
>   
> 
> 
>   
>
> and in the es-ca rules too. So I guess you are calling it right.
>
> It would seem to be a bug of some description.

s_word == "aa", t_word == "TO"
then for s_word: firstupper is false, uppercase is false, sizeone is false

  if(!uppercase || (sizeone && uppercase))
  {
result = t_word;
result[0] = towlower(result[0]);
//result = StringUtils::tolower(t_word);
  }
  else
  {
result = StringUtils::toupper(t_word);
  }
  
  if(firstupper)
  {
result[0] = towupper(result[0]);
  }

gives us "tO" (first test passes). If we change the first test to 

  if(!uppercase || (sizeone && uppercase))
  {
result = t_word;
//result[0] = towlower(result[0]);
result = StringUtils::tolower(t_word);
  }

we get the expected "to". Does anyone know why we would want to only
lowercase the first character? 



On a related note, why is sizeone&&uppercase treated as if it were
lowercase? Isn't it safer to simply ignore sizeone words passed to
modify-case? E.g.

  if(!sizeone){
if(!uppercase) { tolower }
else { toupper }
if(firstupper) { toupper [0] }
  }



-Kevin

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Election of new Apertium PMC

2011-01-25 Thread Kevin Brubeck Unhammer

Hi all, 

The Apertium Project Management Committee has just been elected by the
census of Committers, as per the Apertium By-laws[1]. 

According to the by-laws, the responsibilities of the PMC include
deciding what is suitable for release as an Apertium product,
maintaining the repositories and web sites, speaking on behalf of the
project, resolving license disputes, granting commit access, maintaining
the by-laws, promoting Apertium and attracting and distributing funds of
the project.


The newly elected PMC members are:

 Mikel (president)
 Jacob
 Juan Antonio
 Jim
 Felipe
 Sergio
 Fran


Congratulations to them all :-) 



best regards,
Kevin Brubeck Unhammer, of the Election Board


Footnotes: 
[1]  http://wiki.apertium.org/wiki/By-laws

--
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Null character makes lt-proc (without -z option) exit

2011-01-26 Thread Kevin Brubeck Unhammer
Hi,

The -z option makes lt-proc flush whenever it sees the null character,
which is nice. But if you don't give it -z, it exits on the null
character -- I'm guessing it shouldn't... 

Added a bug here: 
http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=108

(I got a null character out when converting a pdf to text, so they do
occur in the wild.)


--
Kevin Brubeck Unhammer





--
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Null character makes lt-proc (without -z option) exit

2011-01-26 Thread Kevin Brubeck Unhammer
"Jimmy O'Regan"  writes:

> On 26 January 2011 09:34, Kevin Brubeck Unhammer  wrote:
>> Hi,
>>
>> The -z option makes lt-proc flush whenever it sees the null character,
>> which is nice. But if you don't give it -z, it exits on the null
>> character -- I'm guessing it shouldn't...
>>
>
> Yeah, though I think it's one of those things that falls into the
> category of "if this has happened, you have bigger problems than the
> translator not working".
>
> It would probably be enough to either escape or discard nulls in the
> deformatter. Is there any compelling reason to not simply discard
> them?

Only if you want to use lt-proc -z. That is, removing nulls in the
deformatter would have to be optional, so it can still work with
lt-proc -z.

Of course you can just run everything with lt-proc -z anyway... but
maybe that gives other side effects?

>> Added a bug here:
>> http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=108
>>
>> (I got a null character out when converting a pdf to text, so they do
>> occur in the wild.)
>>
>
> Seems to me to be a double bug -- whatever your were using almost
> certainly should not have given you a null in its output.

Of course; notified pdfminer of the bug too.


-Kevin


--
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] lt-proc -b patch

2011-02-01 Thread Kevin Brubeck Unhammer
Hi,

I attached a small patch to lt-proc -b at:
http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=106#c1

I'm not sure if anyone actually uses the --bilingual mode, but it seems
handy for debugging, since it simply does a lookup in the bilingual
dictionary without any transfer rules. However, currently it returns as
unknown anything that has extra symbols, e.g. if your bidix specifies

tenerhave

then lt-proc -b will tell you that ^tener$ is
unknown and give you

^tener/@tener$

which is not very useful for debugging transfer. Transfer with no
transfer rules will return the longest match and then just append any
following tags, giving

^have$

The patch changes lt-proc -b so that it works like transfer in appending
the superfluous tags, but otherwise works like lt-proc -b in that it
also outputs the source language analysis:

^tener/have$


(Since there was an unused "queue" variable in fstprocessor.cc:bilingual
I assume this was the intended behaviour, since that also makes it
possible to use lt-proc -b as a module before a bidix-free transfer
module.)

Please let me know if it works...


-Kevin


--
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] en-es-generador gives a bit too many alternatives on all-caps regexp names

2011-02-06 Thread Kevin Brubeck Unhammer

Hi,

This oddness happens in apertium-en-es, current svn revision:

$ echo Mrs. FOOBAR|apertium -d /l/a/apertium-en-es en-es-generador 
FOOBAR/FOOBAr/FOOBaR/FOOBar/FOObAR/FOObAr/FOObaR/FOObar/FOoBAR/FOoBAr/FOoBaR/FOoBar/FOobAR/FOobAr/FOobaR/FOobar/FoOBAR/FoOBAr/FoOBaR/FoOBar/FoObAR/FoObAr/FoObaR/FoObar/FooBAR/FooBAr/FooBaR/FooBar/FoobAR/FoobAr/FoobaR/Foobar

It seems to be fine up until postchunk:

$ echo Mrs. FOOBAR|apertium -d /path/to/apertium-en-es en-es-postchunk 
^Pn000FOOBAR$^.$

(and the web gives Señora FOOBAR so I guess it did work before).



-- 
Kevin Brubeck Unhammer


http://donttrack.us/ -- because you're worth it


--
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Release of apertium-br-fr 0.4.0 Breton->French

2011-02-07 Thread Kevin Brubeck Unhammer

Congrats :) 

-Kevin

Francis Tyers  writes:
> Hello all
>
> This is just to announce the release of the Apertium Breton->French
> system version 0.4.0. This work has been done by Fulup Jakez of Ofis ar
> Brezhoneg, myself and more recently Guillaume Morin also of Ofis ar
> Brezhoneg.
>
> Coverage: 
>
>  Wikipedia: 89.12%  (0.3.0: 88.0%)
>
> Dictionary stats:
>
>  Breton dictionary: 17,078  (0.3.0: 16,065)
>  Breton-French dictionary: 26,106   (0.3.0: 24,849)
>
> Rule stats:
>
>  Disambiguation: 256(0.3.0: 239)
>  Transfer:
>- t1x: 166   (0.3.0: 163)
>- t2x: 79(0.3.0: 72)
>- t3x: 6 (0.3.0: 4)
>
> You can download it from the usual places, and it's installed on xixona.
>
> There has been no post-edition evaluation conducted yet.
>
> Fran
>
>
> --
> The modern datacenter depends on network connectivity to access resources
> and provide services. The best practices for maximizing a physical server's
> connectivity to a physical network are well understood - see how these
> rules translate into the virtual world? 
> http://p.sf.net/sfu/oracle-sfdevnlfb
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Proposal: alpha releases/'staging' area in SVN

2011-02-08 Thread Kevin Brubeck Unhammer
Gema Ramírez-Sánchez  writes:

> So, correct if wrong...
>
>  TRUNK: would contain pairs in "release" and "[alpha|pre]-release"
> status an no pair without a release.
>
>  STAGING: would contain pairs that builds and have an advanced status
> of all modules (dictionaries with closed cathegories and a decent
> coverage, an "ad hoc" PoS tagset and .prob, good coverage of main
> contrastive phenomena, testvoc clean, and a post-generator if needed).
> There should be a "PROBLEMS" file saying, IMO, status and major
> problems found while reading translations done with this pair (so, not
> a complete evaluation but a general compilation of big problems
> detected at the output). That could be a middle solution between Fran
> and Jim point of view. Would  [Alpha|Pre]-releases also be allowed for
> pairs in staging?
>
> NURSERY: would contain pairs that build but that have not been
> developed deeply or maybe data copied from other pairs that needs work
> or very poor data on some modules, etc.
>
>  INCUBATOR: would contain pairs with pieces of translators
>
> I like the idea in general, how should we go ahead? making a proposal
> and submitting it to the PMC?

Agreed. I like the idea too. Having pairs that are a little testvoc away
from release in the same place as pairs that have 20 lines of bidix is
both confusing and demotivating.

I'm for doing prereleases too; and not only because the first week after
a release you remember that all-important bug you forgot to fix... I
think it can be motivating to someone who's stuck in testvoc to get
something out there and hopefully some feedback.  Also, there are very
few releases compared to the amount of work going on in SVN; if nothing
else, I think more releases could be good for publicity...  Of course,
like svn moving, it should be up to the language pair maintainers.



--
Kevin Brubeck Unhammer


Sent from my emacs.

--
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] AWK learning for a begineer working on a language pair

2011-02-26 Thread Kevin Brubeck Unhammer
Sumit Bhandari  writes:

> Hi, I am a beginner with Apertium and want to ask if we need to learn AWK to 
> add a new
> Language pair.
> Please help.

You don't need to learn AWK :) Of course, AWK can be very *useful* to
learn, but you can manage perfectly well without it. You do need to
learn some XML, but mostly it's very simple.

I see the page
http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO has a
little awk/gawk script. This script just pretends to do what the program
apertium-tagger does (http://wiki.apertium.org/wiki/Apertium-tagger);
you wouldn't use it in a real language pair though. The reason it's used
in the HOWTO is just that making the data to run apertium-tagger is a
whole other job...




-- 
Kevin Brubeck Unhammer

--
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC,New Language Pair tr-ky.

2011-03-28 Thread Kevin Brubeck Unhammer
mirlan  writes:

>> ** If you use trmorph, how will you trim the lemmas to the contents of   
>>
>> the bilingual dictionary ?   
>>
> I am working on it.

Please explain how ;-) 

The regular method[1] is to take an lttoolbox analyser, and find the
full set of possible input-output pairs using the program lt-expand, and
run that through the translator to check for errors. Unfortunately, when
your analyser is in SFST/HFST-format -- which opens for lots of "loops"
in the analyser -- things get a bit more complicated. Brian Croom's
hfst-fst2strings[2] attempts to do something similar to lt-expand, while
providing some ways to filter the possibilities.


> * How will you make the bilingual lexicon ? I presume there are few   
>   
> freely-available (e.g. open-source/free software) dictionaries, so you
>   
> will probably have to build your own. Someone with experience of  
>   
> Apertium can do ~400 words in a day, so we would like to see a start on   
>   
> the lexicon to make sure you understand the problems involved.
>   
>   
> Right now i have StarDict tr-ky dicitionary, i hope it could help me. 
>   

Is there a link? Does it have part-of-speech (word class) information?
(That would make it a lot easier to use.)

> * It would be a good idea to start looking at any transfer
>   
> (syntactic/morphological) issues between the two languages.   
>   
>   
>   
> tr-ky have some similarities […]

We are more interested in the differences ;) E.g. differences in case
system, inflection, word order, etc.

The best way to document such differences (or similarities) is to make a
page like http://wiki.apertium.org/wiki/English_and_French/Pending_tests
which you can then test your language pair on.


Do come on IRC more so we can discuss the issues and any possible
problems you have; we don't want anyone to waste lots of time on
something that could be solved by discussing it on IRC :)




best regards,
Kevin Brubeck Unhammer



Footnotes: 

[1]  http://wiki.apertium.org/wiki/Testvoc

[2]  
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinYnDtHehxWWAJf25JVXKYaM0Uw95Kzr41jgKZo%40mail.gmail.com&forum_name=apertium-stuff


--
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC11 Draft Proposal: Rule-based finite-state disambiguation

2011-04-01 Thread Kevin Brubeck Unhammer
in
> Apertium.
>
>   4 week community Bonding: Reading around the subject area and
> acquiring specific skills such as C++, parser design, language
> processing and computational linguistics
>   12 week coding period: pursuant primarily to detail item (5)
>   1 week sprint: final polish, debugging and documentation effort

I'd like to see a more detailed plan, especially wrt. which features
should be implemented and prioritised. Some of the CG functions
implemented by e.g. vislcg3[1] are a lot more important than others, so
think about the feature set and test cases for that. E.g. LIST,
SELECT/REMOVE, star (*), BARRIER, Careful (C) are important. Things like
spanning window boundaries, setting marks or making dependency trees
should be deferred until much later. Unification is possible to avoid by
just writing more rules.


[...]

> Recently I’ve been working on an online chessboard (jQuery/node.js),

Include the URL in your proposal, if you can ;) 







-- 
Kevin Brubeck Unhammer

--
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Update propsal for GSoC 2011 Apertium tr-ky language pair.

2011-04-04 Thread Kevin Brubeck Unhammer
mirlan  writes:

> Hi,
> Please find attached my proposal for GSoC 2011.

Looks promising, but please make sure you answer all the questions in
http://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications#Template
(and in the same order).

What do you plan to do in the Community Bonding period?



best regards,
Kevin Brubeck Unhammer

--
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Which package to download?

2011-05-03 Thread Kevin Brubeck Unhammer
Congmin min  writes:

> Hi, I am new to Apertium and have two questions for your help with:
> 1) It seems there is not a single bundled package on sourceforge for 
> downloading. Then
> which ones should I download for Linux or Windows? For example, I want to 
> download and
> install, and then try out the English-spanish translation first.

lttoolbox, apertium, apertium-en-es 
(install them in that order)

However, if you're planning on developing a language pair, it would be
better to install from SVN:
http://wiki.apertium.org/wiki/Minimal_installation_from_SVN

> 2) Is it possible to develop an English-Chinese language pair, without 
> significantly
> change the system?

There might _eventually_ be a problem with handling the alphabet size in
lttoolbox, although if I remember the conversation from last time,
jimregan said it shouldn't be too much trouble to fix… Other than that,
I can't foresee any technical issues.


-- 
Kevin Brubeck Unhammer


Sent from my emacs.


--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium-stuff Digest, Vol 49, Issue 4

2011-05-24 Thread Kevin Brubeck Unhammer
Aish Raj Dahal  writes:

>> In the above example, I have noticed that the "ukar" symbol of
>> Devnagari is not being rendered so ? is being seen as ???.
>> There is also a problem with rendering of "half letters" (sorry, i do
>> not know the linguistic term for it). Here is an example of what I
>> mean:
>>
>> echo "computer"|apertium en-ne
>> ??
>> In the above example the word ?computer? should have given ?
>
> I get ? -- the problem is with your terminal not rendering the
> combining characters, not with Apertium. gnome-terminal is known to
> have issues with Devanagari, is that what you're using?
>
> Well, I guessed so. I am using the terminal "Konsole" under KDE 4.6 (Kubunutu 
> 11.04). Is
> there a way to work around this problem?

I get the same behaviour under Konsole on Arch Linux with KDE 4.6:

$ echo computer | apertium -d . en-ne
कमपयटर

while piping into a file gives me कम्प्युटर

It seems to be a known bug, with a patch (last one from 2 years ago?) if
you feel like recompiling: http://bugs.kde.org/show_bug.cgi?id=156071

But it might be quicker to just install gnome-terminal/xterm/something
else. Or open emacs and do M-x shell, which displays it correctly :)


-- 
Kevin Brubeck Unhammer

--
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] fun with configure (on OS X)

2011-06-06 Thread Kevin Brubeck Unhammer
Sagie Maoz  writes:

> Hello,
>
> So after re-shuffling some paths in my OS X system (specifically, switching 
> from using
> MacPorts to Homebrew), I ran into all sorts of problems running and compiling 
> apertium
> tools that wasted a lot of my time.
> I eventually found most of the solutions in... the Apertium on Mac OS X page 
> at the wiki
> [1]. The biggest problem though, other than I wasn't aware of this page until 
> Google
> found it for me, is that the instructions can be very confusing;
>
> 1. Most of the requirements listed aren't needed for a successful build.
> I had so much trouble installing libxml2 until I figured out that OS X comes 
> bundled
> with it and installing it myself causes a conflict. However;
> 2. ./configure consistently fails with the error message "syntax error near 
> unexpected
> token", on both lttoolbox and apertium, probably because of using the bundled 
> library.
> The wiki page has a solution for that (running aclocal, automake, etc. 
> manually) and it
> works, but it seems funny that I'm now banned from using autogen.sh in all of 
> the
> projects.
> I'm not at all familiar with configure and make scripts, but I hope these 
> sorts of
> things can be fixed within the scripts.

When you're running these commands manually, do you paste _everything_
from the autogen.sh script? And it gives different results? In that
case, running 

source autogen.sh

should also work (more or less equivalent to pasting the whole script in
the terminal), even though running ./autogen.sh does not.

Following that line of thought, perhaps there are two versions of
automake etc. on your system, you could try to add the line

which aclocal autoconf automake 

to the script, and compare what that gives with what you get when
running that line in the terminal.

-- 
Kevin Brubeck Unhammer

--
Simplify data backup and recovery for your virtual environment with vRanger.
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Discover what all the cheering's about.
Get your free trial download today. 
http://p.sf.net/sfu/quest-dev2dev2 
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] "Official" Apertium buttons

2011-06-07 Thread Kevin Brubeck Unhammer
Fajro  writes:

> On Tue, Jun 7, 2011 at 3:16 PM, Mikel Forcada  wrote:
>> Hi Apertiumers,
>> would HTML/javascript "buttons" such as the one below (which of course
>> can easily be improved) be acceptable to the Apertium community.
>
> +1.
>
>
> I made a facebook page 2 years ago: http://www.facebook.com/Apertium
>
> Still less than 50 fans :(  Anyone want to be admin?
>
>
> Apertium also should have a cool blog; something like
> http://googletranslate.blogspot.com/ but better.

Or at least a planet (blog aggregator) ?
(see https://secure.wikimedia.org/wikipedia/en/wiki/Planet_%28software%29)

-Kevin

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Constraint Grammer infrastructure at risk

2011-06-09 Thread Kevin Brubeck Unhammer

The University of Southern Denmark has decided to cut financial support
of the VISL Constraint Grammar infrastructure, and the developers are
calling for moral/financial contributions or lobbying initiatives:

https://groups.google.com/group/constraint-grammar/browse_thread/thread/515081fab2b2797d



-- 
Kevin Brubeck Unhammer

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


<    1   2   3   4   5   6   >