Kevin Brubeck Unhammer kirjoitti 9. aug. 2012 kello 14:54:

> Francis Tyers <fty...@prompsit.com> writes:
>> El dj 09 de 08 de 2012 a les 10:35 +0200, en/na Per Tunedal va escriure:
>>> I consider Apertium suitable for translating the pair Swedish -
>>> Norwegian

Yes.

>>> 3. You might use a level 1 translation (without constraint grammar),
>>> like the pair Swedish - Danish. In that case, you could make the
>>> translation usable for a wide audience by adding the pair to Apertium
>>> Caffeine and the new OmegaT plug-in.
>> 
>> In any case there is no free constraint grammar of Swedish currently 
>> available.

The lack of CG for Swedish is a problem. My suggestion would be to write one. 
To be a bit specific:
To write the 100-or-so rules needed for removing the gross majority, say 80(?)% 
of the ambiguity.

> What you're describing is gisting/translation for understanding; I can't
> imagine gisting MT would be very useful for sv-nb/nn (and I suspect
> people would use Google for that anyway).

>From the Norwegian side, we cannot imagine the need for a sv-nb/nn gisting 
>system. The maximum help we would need is, in rare cases, a dictionary 
>translating  a small number of hard words.

How hard Norwegian is for Swedes is of course up to the Swedes to judge. But 
the competition will be between understanding the Norwegian text and 
understanding (sic) the MT output.

> But with these closely related
> languages, it's possible to get to a standard good enough for
> post-editing (pre-publishing), e.g. with OmegaT as you mentioned, and in
> that case the users definitely know which language it is already.

Yes, a production system (say, I want to translate a sv article to nn on 
Wikipedia) is a different matter. My experience from  nn-nb translation is that 
time saving from post editing as compared to rewriting/translation lies around 
80%.

So yes, that can be a good idea. __But__ nb-nn lexicon and orthographic 
principles are the same, so more often than not unknown words will come out as 
free rides. For sv-nn/nb that will __not__ be the same (to the same extent), 
since both vocabulary and orthography deviates more. So, less free rides for 
unknown words. This implies that the transfer lexicon must be __much__ bigger 
than the nb-nn one in order to get the same good results as we have for nb-nn. 
The good news is that the making of such an enlarged transfer lexicon in part 
can be done automatically, and then manually post edited.

>> 
>> (3) You make the two translators in the one pair. For this, you could
>> have the same Swedish dictionary, but would need different nb and nn
>> dictionaries, different sv-nb and sv-nn dictionaries and different sv-nb
>> and sv-nn transfer rules.

> (3) sounds best to me too.
I agree.

> Perhaps you could even do with one bidix, and
> just use the alt="nn" vs alt="nb" attribute; a rough and dirty count
> shows that the majority of entries in the nn-nb bidix carry over the
> same lemma/tag:

This could very well be the case, yes (cf. my experiences with free rides).

> That said, I would pick one first and get the system up and running,
> then expand to both later on.

This is also a possibility, yes. But the expansion to both languages should be 
taken into account in the setup phase.


> https://en.wikipedia.org/wiki/Language_identification
> Using a library like that makes general (you can use it for lots of
> languages) and is a *lot* faster than translating everything twice (or
> thrice or …).

Yes. Language identifications.


> http://www.nb.no/spraakbanken/tilgjengelege-ressursar/tekstressursar has
> more frequency lists (they also taunt you with this enormous corpus, but
> it's currently "in beta", very messy, and best avoided for now).

The best resource is the NoWaC corpus, it also has frequency lists, both for 
lemmata and for word forms.

My final comment would be that the work will be 

1 in the analysis/generation of Swedish
2 … and in the bidix.

As for 1, we should look around in the Swedish language technology landscape 
and look for open resources, e.g. in Gothenburg (Aarne Ranta, also Språkbanken).

As for 2, Lexin might be one resource. I am on Euralex in Oslo right now, and 
will ask around.

Trond.






------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to