Hi, I consider Apertium suitable for translating the pair Swedish - Norwegian for the following reasons:
1. They are closely related. 2. You don't have an abundance of free bilingual resources, as Norway doesn't belong to EU. Thus, a statistical approach would be difficult. 3. You might use a level 1 translation (without constraint grammar), like the pair Swedish - Danish. In that case, you could make the translation usable for a wide audience by adding the pair to Apertium Caffeine and the new OmegaT plug-in. Is anyone working with the pair for the moment? I might start some work to begin familiarizing me with Apertium. Some considerations: A. Written Norwegian is in fact two different languages; Bokmål (nb) and Nynorsk (nn). If I simplify a lot, the former is basically Danish written by Norwegians (some words are completely different from Danish) and the later is a codification of the spoken traditional Norwegian (different words and a more complicated grammar). Both languages are official in Norway, but some variant is preferred in certain areas and by certain individuals. However, Bokmål is the dominating variant (80-90 %). How to handle this, when translating from Norwegian to Swedish? If a user encounters some text in Norwegian, he doesn't know if it's Bokmål or Nynorsk. He just surfed to some page with some interesting facts about bird watching, cod fishing, hiking in the mountains or what ever he is interested in. He just wants to translate the content. Perhaps Apertium could do some test-translation to see if the text is written in Bokmål or Nynorsk? An then use the most fruitful translation pair for the translation to Swedish. Or just ignore Nynorsk? Wouldn't that be a shame? B. I have looked in the repository and found that some work has been done on the following dictionaries: Danish (da) - Norwegian Bokmål (nb) - nursery Swedish (sv) - Norwegian Bokmål (nb) - incubator Tihomir told me he's working on Swedish-Icelandic and has expanded the Swedish monolingual dictionary from sv-da. But which is the most complete Norwegian Bokmål (nb) monolingual dictionnary? The one from the pair Norwegian Bokmål (nb) - Norwegian Nynorsk (nn)? C. Is it possible to reuse some transfer rules? If Danish and Norwegian Bokmål are very similar, perhaps it's possible to reuse the transfer rules da-sv from the pair Danish (da) - Swedish (sv) for the translation from Swedish to Norwegian Bokmål (nb)? And the same in the other direction (i.e. convert the transfer rules for sv-da to rules for sv-nb)? Perhaps the maintainer of Danish (da) - Norwegian Bokmål (nb) can give me a hint? He's probably very updated on the differences between the two languages. D. Linguistic resources for Norwegian. I have found frequency word lists for Norwegian Bokmål (nb) at http://helmer.aksis.uib.no/nta/ and can thus prioritize my work to the most important words. Online dictionnaries and grammatical resources can be found at the site of the Norwegian språkråd http://www.sprakrad.no/ . What about corpus? I have found some bilingual data at Uppsala University http://opus.lingfil.uu.se/ (very low quality!). Any one who has found any other bilingual resources nb-sv? Any monolingual data? Someone who knows about any good tool for extracting texts from the internet? I have tried a lot of them, the most promising are Corpuscatcher (the Yahoo API is obsolete and searching from a list of URLs doesn't work as expected), Webharvest (I haven't figured out the syntax yet) and Webextractor360 (need to update my knowledge of regexp). Corpuscatcher would do all the steps, if it worked: find promising websites, download the pages and convert the pages to text. By the way, what about copy right issues? What can I do with downloaded web pages? As far as I know it wouldn't be any problem to: - make word lists - count words to build word frequency lists - make transfer rules or a constraint grammar - study any linguistic (or other) properties of the texts - test a translation engine on the texts I suppose it would be more difficult if I would work with statistical machine translation, e.g. Moses, as this implies that I make a database of phrases. And even use the phrases in the produced translations. I suppose it would be possible to claim copy right to a phrase, but not to a word. What's the practice? E. Any advice for me if I start working on the pair Swedish (sv) - Norwegian Bokmål (nb)? Have I missed something I need to know? Any other resources I can use? Yours, Per Tunedal ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff