In my opinion the only thing that is going to work on short term is a guided rule based system. We need that to be able to reuse values from Wikidata in running text. That is a template text must be transformed according to gender, plurality, etc, but also that the values must be adjust to genitive, locative, illative, etc, forms.
https://meta.wikimedia.org/wiki/Wikimedia_Fellowships/Project_Ideas/Tools_for_text_synthesis On Sat, Jul 27, 2013 at 5:40 PM, David Cuenca <dacu...@gmail.com> wrote: > On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian > <canan...@wikimedia.org>wrote: > >> My main point was just that there is a chicken-and-egg problem here. You >> assume that machine translation can't work because we don't have enough >> parallel texts. But, to the extent that machine-aided translation of WP is >> successful, it creates a large amount of parallel text. I agree that >> there are challenges. I simply disagree, as a matter of logic, with the >> blanket dismissal of the chickens because there aren't yet any eggs. >> > > I think we both agree about the need and usefulness of having a copious > amount of parallel text. The main difficulty is how to get there from > scratch. As I see it there are several possible paths > - volunteers creating the corpus manually (some work done, however not > properly tagged) > - use a statistic approach to create the base text and volunteers would > improve that text only > - use rules and statistics to create the base text and volunteers would > improve the text and optionally the rules > > The end result of all options is the creation of a parallel corpus that can > be reused for statistic translation. In my opinion, the efectivity of > giving users the option to improve/select the rules is much larger than > improving the text only. It complements statistic analysis rather than > replacing it and it provides a good starting point to solve the egg-chicken > conundrum, specially in small Wikipedias. > > Currently translatewiki is relying on external tools where we don't have > much control, besides of being propietary and with the risk that they can > be disabled any time. > > I think you're attributing the faults of a single implementation/UX to the >> technique as a whole. (Which is why I felt that "step 1" should be to >> create better tools for maintaining information about parallel structures >> in the wikidata.) >> > > Good call. Now that you mention it, yes, it would be great to have a place > where to keep a parallel corpus, and it would be even more useful if it can > be annotated with wikidata-wiktionary senses. A wikibase repo might be the > way to go. No idea if Wikidata or Translatewiki are the right places to > store/display it. Maybe it will be a good time to discuss it during > Wikimania. I have added it to the "elements" section. > > >> >> In a world with an active Moore's law, WP *does* have the computing power >> to approximate this effort. Again, the beauty of the statistical approach >> is that it scales. >> > > My main concern about statistic-based machine translation is that it needs > volume to be effective, hence the proposal to use rule-based translation to > reach the critical point faster than just using statistics on existing text > alone. > > >> >> I'm sure we can agree to disagree here. Probably our main differences are >> in answers to the question, "where should we start work"? I think >> annotating parallel texts is the most interesting research question >> ("research" because I agree that wiki editing by volunteers makes the UX >> problem nontrivial). I think your suggestion is to start work on the >> "semantic multilingual dictionary"? >> > > It is quite possible to have multiple developments in parallel. That a > semantic dictionary is in development doesn't hinder the creation of a > parallel corpus or an interface for annotating. The same applies to > statistics/rules, they are not incompatible, in fact they complement each > other pretty well. > > >> ps. note that the inter-language links in the sidebar of wikipedia articles >> already comprise a very interesting corpus of noun translations. I don't >> think this dataset is currently exploited fully. >> > > I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure > some of it can be reused: > http://www.cosyne.eu/index.php/Main_Page > > Cheers, > David > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l