Francis Tyers <fty...@prompsit.com> writes:

> El dj 09 de 08 de 2012 a les 10:35 +0200, en/na Per Tunedal va escriure:
>> Hi,
>> I consider Apertium suitable for translating the pair Swedish -
>> Norwegian for the following reasons:
>> 
>> 1. They are closely related.
>>  
>> 2. You don't have an abundance of free bilingual resources, as Norway
>> doesn't belong to EU. Thus, a statistical approach would be difficult.
>> 
>> 3. You might use a level 1 translation (without constraint grammar),
>> like the pair Swedish - Danish. In that case, you could make the
>> translation usable for a wide audience by adding the pair to Apertium
>> Caffeine and the new OmegaT plug-in.
>
> In any case there is no free constraint grammar of Swedish currently 
> available.
>
>> Is anyone working with the pair for the moment? I might start some work
>> to begin familiarizing me with Apertium.
>
> No-one is currently working on the pair.
>
>> Some considerations:
>> 
>> A. Written Norwegian is in fact two different languages; Bokmål (nb) and
>> Nynorsk (nn). If I simplify a lot, the former is basically Danish
>> written by Norwegians (some words are completely different from Danish)
>> and the later is a codification of the spoken traditional Norwegian
>> (different words and a more complicated grammar). Both languages are
>> official in Norway, but some variant is preferred in certain areas and
>> by certain individuals. However, Bokmål is the dominating variant (80-90
>> %).
>> 
>> How to handle this, when translating from Norwegian to Swedish? If a
>> user encounters some text in Norwegian, he doesn't know if it's Bokmål
>> or Nynorsk. He just surfed to some page with some interesting facts
>> about bird watching, cod fishing, hiking in the mountains or what ever
>> he is interested in. He just wants to translate the content.

What you're describing is gisting/translation for understanding; I can't
imagine gisting MT would be very useful for sv-nb/nn (and I suspect
people would use Google for that anyway). But with these closely related
languages, it's possible to get to a standard good enough for
post-editing (pre-publishing), e.g. with OmegaT as you mentioned, and in
that case the users definitely know which language it is already.

> There are three possibilities. 
>
> (1) You can make an sv-nb (or sv-nn) translator, and then include a
> subset of the nn-nb translator in it, piping the output of sv-nb into
> sv-nn. (here you would have an sv-nb dictionary and an nb-nn dictionary)
>
> (2) You make two translators in parallel.
>
> (3) You make the two translators in the one pair. For this, you could
> have the same Swedish dictionary, but would need different nb and nn
> dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb
> and sv-nn transfer rules.
>
> I think that (3) is probably best, but would like input from others
> (e.g. Unhammer or Trond).

(3) sounds best to me too. Perhaps you could even do with one bidix, and
just use the alt="nn" vs alt="nb" attribute; a rough and dirty count
shows that the majority of entries in the nn-nb bidix carry over the
same lemma/tag:

$ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1==$2'|wc -l
71628
$ lt-expand apertium-nn-nb.nn-nb.dix | grep -v ':[<>]:' | awk -F: '$1!=$2'|wc -l
11365


That said, I would pick one first and get the system up and running,
then expand to both later on.


>> Perhaps Apertium could do some test-translation to see if the text is
>> written in Bokmål or Nynorsk? An then use the most fruitful translation
>> pair for the translation to Swedish. Or just ignore Nynorsk? Wouldn't
>> that be a shame?

There are really good language identification programs out there that
figure out the source language in a much simpler way, see e.g.
http://software.wise-guys.nl/libtextcat/ or the external links on
https://en.wikipedia.org/wiki/Language_identification

Using a library like that makes general (you can use it for lots of
languages) and is a *lot* faster than translating everything twice (or
thrice or …).

> Ignoring Nynorsk would be a great shame! Especially since it is the
> favoured variant of Norwegian speakers working on Apertium ;)
>
>> B. I have looked in the repository and found that some work has been
>> done on the following dictionaries:
>> 
>> Danish (da) - Norwegian Bokmål (nb) - nursery
>> Swedish (sv) - Norwegian Bokmål (nb) - incubator
>>
>> Tihomir told me he's working on Swedish-Icelandic and has expanded the
>> Swedish monolingual dictionary from sv-da. But which is the most
>> complete Norwegian Bokmål (nb) monolingual dictionnary? The one from the
>> pair Norwegian Bokmål (nb) - Norwegian Nynorsk (nn)?
>
> Yes, I would take the Swedish dictionary from sv-is and the Norwegian
> dictionar(y,ies) from nn-nb.
>
>> C. Is it possible to reuse some transfer rules?
>
> The transfer rules are the least of your worries. sv-da has a grand
> total of 6, and nn-nb 13. 
>
>> If Danish and Norwegian Bokmål are very similar, perhaps it's possible
>> to reuse the transfer rules da-sv from the pair Danish (da) - Swedish
>> (sv) for the translation from Swedish to Norwegian Bokmål (nb)? And the
>> same in the other direction (i.e. convert the transfer rules for sv-da
>> to rules for sv-nb)?
>
> Reusing transfer rules probably isn't necessary. If you don't feel like
> writing them, then you can write testcases on the Wiki and ask someone
> on the list to write them. 

Well, from nb to sv you could copy-paste some of the compound chunking
rules, but yeah transfer rules don't take very long to write.

>> Perhaps the maintainer of Danish (da) - Norwegian Bokmål (nb) can give
>> me a hint? He's probably very updated on the differences between the two
>> languages.
>
> There is no maintainer that I know of. 

And I don't think that pair has any work done apart from bidix entries …

>> D. Linguistic resources for Norwegian.
>> 
>> I have found frequency word lists for Norwegian Bokmål (nb) at
>> http://helmer.aksis.uib.no/nta/ and can thus prioritize my work to the
>> most important words.

http://www.nb.no/spraakbanken/tilgjengelege-ressursar/tekstressursar has
more frequency lists (they also taunt you with this enormous corpus, but
it's currently "in beta", very messy, and best avoided for now).


[…]

>> E. Any advice for me if I start working on the pair Swedish (sv) -
>> Norwegian Bokmål (nb)? Have I missed something I need to know? Any other
>> resources I can use?
>
> My advice would be to start small, to avoid getting overwhelmed. 
>
> Start from scratch on a small task. For example translating this short
> story: 
>
> http://www.unilang.org/ulrview.php?res=422,416
>
> Once you have managed to make the system to translate this without any
> system errors (the @, * # you see, not necessarily translation errors),
> then you should have a good understanding of the system, and be well
> founded to start working with the other resources.
>
> It shouldn't take longer than a week, and some have done it in a couple
> of days.

+1 on that.


-- 
Kevin Brubeck Unhammer

GPG: 0x766AC60C


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to