In my opinion the only thing that is going to work on short term is a
guided rule based system. We need that to be able to reuse values from
Wikidata in running text. That is a template text must be transformed
according to gender, plurality, etc, but also that the values must be
adjust to genitive, locative, illative, etc, forms.

https://meta.wikimedia.org/wiki/Wikimedia_Fellowships/Project_Ideas/Tools_for_text_synthesis

On Sat, Jul 27, 2013 at 5:40 PM, David Cuenca <dacu...@gmail.com> wrote:
> On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian
> <canan...@wikimedia.org>wrote:
>
>> My main point was just that there is a chicken-and-egg problem here.  You
>> assume that machine translation can't work because we don't have enough
>> parallel texts.  But, to the extent that machine-aided translation of WP is
>> successful, it creates a large amount of parallel text.   I agree that
>> there are challenges.  I simply disagree, as a matter of logic, with the
>> blanket dismissal of the chickens because there aren't yet any eggs.
>>
>
> I think we both agree about the need and usefulness of having a copious
> amount of parallel text. The main difficulty is how to get there from
> scratch. As I see it there are several possible paths
> - volunteers creating the corpus manually (some work done, however not
> properly tagged)
> - use a statistic approach to create the base text and volunteers would
> improve that text only
> - use rules and statistics to create the base text and volunteers would
> improve the text and optionally the rules
>
> The end result of all options is the creation of a parallel corpus that can
> be reused for statistic translation. In my opinion, the efectivity of
> giving users the option to improve/select the rules is much larger than
> improving the text only. It complements statistic analysis rather than
> replacing it and it provides a good starting point to solve the egg-chicken
> conundrum, specially in small Wikipedias.
>
> Currently translatewiki is relying on external tools where we don't have
> much control, besides of being propietary and with the risk that they can
> be disabled any time.
>
> I think you're attributing the faults of a single implementation/UX to the
>> technique as a whole.  (Which is why I felt that "step 1" should be to
>> create better tools for maintaining information about parallel structures
>> in the wikidata.)
>>
>
> Good call. Now that you mention it, yes, it would be great to have a place
> where to keep a parallel corpus, and it would be even more useful if it can
> be annotated with wikidata-wiktionary senses. A wikibase repo might be the
> way to go. No idea if Wikidata or Translatewiki are the right places to
> store/display it. Maybe it will be a good time to discuss it during
> Wikimania. I have added it to the "elements" section.
>
>
>>
>> In a world with an active Moore's law, WP *does* have the computing power
>> to approximate this effort.  Again, the beauty of the statistical approach
>> is that it scales.
>>
>
> My main concern about statistic-based machine translation is that it needs
> volume to be effective, hence the proposal to use rule-based translation to
> reach the critical point faster than just using statistics on existing text
> alone.
>
>
>>
>> I'm sure we can agree to disagree here.  Probably our main differences are
>> in answers to the question, "where should we start work"?  I think
>> annotating parallel texts is the most interesting research question
>> ("research" because I agree that wiki editing by volunteers makes the UX
>> problem nontrivial).  I think your suggestion is to start work on the
>> "semantic multilingual dictionary"?
>>
>
> It is quite possible to have multiple developments in parallel. That a
> semantic dictionary is in development doesn't hinder the creation of a
> parallel corpus or an interface for annotating. The same applies to
> statistics/rules, they are not incompatible, in fact they complement each
> other pretty well.
>
>
>> ps. note that the inter-language links in the sidebar of wikipedia articles
>> already comprise a very interesting corpus of noun translations.  I don't
>> think this dataset is currently exploited fully.
>>
>
> I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure
> some of it can be reused:
> http://www.cosyne.eu/index.php/Main_Page
>
> Cheers,
> David
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to