Since I was thinking about how to do this for some time, I wrote some developers' notes at http://meta.wikimedia.org/wiki/Wikidata/Notes/Article_generation so feel free to comment if anything is not clear or not desirable.

On 26/04/13 14:10, Jane Darnell wrote:
Well, I am going to come out of the closet here and admit that I for
one will sometimes want to read that machine-generated text over the
human-written English one. Sometimes to uncover the real little gems
of Wikipedia, you need to have a lot of patience with Google translate
options.

2013/4/26, Delirium <delir...@hackish.org>:
This is a very interesting proposal. I think how well it will work may
vary considerably based on the language.

The strongest case in favor of machine-generating stubs, imo, is in
languages where there are many monolingual speakers and the Wikipedia is
already quite large and active. In that case, machine-generated stubs
can help promote expansion into not-yet-covered areas, plus provide
monolingual speakers with information they would otherwise either not
get, or have to get in worse form via a machine-translated article.

At the other end of the spectrum you have quite small Wikipedias, and
Wikipedias which are both small and read/written mostly/entirely by
bilingual readers. In these Wikipedias, article-writing tends to focus
on things more specifically relevant to a certain culture and history.
Suddenly creating tens or hundreds of thousands of stubs in such
languages might serve to dilute a small Wikipedia more than strengthen
it: if you take a Wikipedia with 10,000 articles, and it gains 500,000
machine-generated stubs, *almost every* article that comes up in search
engines will be machine-generated, making it much less obvious what
parts of the encyclopedia are actually active and human-written amidst
the sea of auto-generated content.

Plus, from a reader's perspective, it may not even improve the
availability of information. For example, I doubt there are many
speakers of Bavarian who would prefer to read a machine-generated
bar.wiki article, over a human-written de.wiki article. That may even be
true for some less-related languages: most Danes I know would prefer a
human-written English article over a machine-generated Danish one.

-Mark


On 4/25/13 8:16 PM, Erik Moeller wrote:
Millions of Wikidata stubs invade small Wikipedias .. Volapük
Wikipedia now best curated source on asteroids .. new editors flood
small wikis .. Google spokesperson: "This is out of control. We will
shut it down."

Denny suggested:

II ) develop a feature that blends into Wikipedia's search if an article
about a topic does not exist yet, but we  have data on Wikidata about
that
topic
Andrew Gray responded:

I think this would be amazing. A software hook that says "we know X
article does not exist yet, but it is matched to Y topic on Wikidata"
and pulls out core information, along with a set of localised
descriptions... we gain all the benefit of having stub articles
(scope, coverage) without the problems of a small community having to
curate a million pages. It's not the same as hand-written content, but
it's immeasurably better than no content, or even an attempt at
machine-translating free text.

XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
Vietnam]. It [grows to: 20 cm]. (pictures)
This seems very doable. Is it desirable?

For many languages, it would allow hundreds of thousands of
pseudo-stubs (not real articles stored in the DB, but generated from
Wikidata) to be served to readers and crawlers that would otherwise
not exist in that language.

Looking back 10 years, User:Ram-Man was one of the first to generate
thousands of en.wp articles from, in this case, US census data. It was
controversial at the time and it stuck. Other Wikipedias have since
then either allowed or prohibited bot-creation of articles on a
project-by-project basis. It tends to lead to frustration when folks
compare article counts and see artificial inflation by bot-created
content.

Does anyone know if the impact of bot-creation on (new) editor
behavior has been studied? I do know that many of the Rambot articles
were expanded over time, and I suspect many wouldn't have been if they
hadn't turned up in search engines in the first place. On the flip
side, a large "surface area" of content being indexed by search
engines will likely also attract a fair bit of drive-by vandalism that
may not be detected because those pages aren't watched.

A model like the proposed one might offer a solution to a lot of these
challenges. How I imagine it could work:

* Templates could be defined for different Wikidata entities. We could
make it possible to let users add links from items in Wikidata to
Wikipedia articles that don't exist yet. (Currently this is
prohibited.) If such a link is added, _and_ a relevant template is
defined for the Wikidata entity type (perhaps through an entity
type->template mapping), WP will render an article using that
template, pulling structured info from Wikidata.

* A lot of the grammatical rules would be defined in the template
using checks against the Wikidata result. Depending on the complexity
of grammatical variations beyond basics such as singular/plural this
might require Lua scripting.

* The article is served as a normal HTTP 200 result, cached, and
indexed by search engines. In WP itself, links to the article might
have some special affordance that suggests that they're neither
ordinary red links nor existing articles.

* When a user tries to edit the article, wikitext (or visual edit
mode) is generated, allowing the user to expand or add to the
automatically generated prose and headings. Such edits are tagged so
they can more easily be monitored (they could also be gated by default
if the vandalism rate is too high).

* We'd need to decide whether we want these pages to show up in
searches on WP itself.

Advantages:

* These pages wouldn't inflate page counts, but they would offer
useful information to readers and be higher quality than machine
translation.

* They could serve as powerful lures for new editors in languages that
are currently underrepresented on the web.

Disadvantages/concerns:

* Depending on implementation, I continue to have some concern about
{{#property}} references ending up in article text (as opposed to
templates); these concerns are consistent with the ones expressed in
the en.wp RFC [1]. This might be mitigated if Visual Editor offers a
super-intuitive in-place editing method. {{#property}} references in
text could also be converted to their plain text representation the
moment a page is edited by a human being (which would have its own set
of challenges, of course).

* How massive would these sets of auto-generated articles get? I
suspect the technical complexity of setting up the templates and
adding the links in Wikidata itself would act as a bit of a barrier to
entry. But vast pseudo-article sets in tiny languages could pose
operational challenges without adding a lot of value.

* Would search engines penalize WP for such auto-generated content?

Overall, I think it's an area where experimentation is merited, as it
could not only expand information in languages that are
underrepresented on the web, but also act as a force multiplier for
new editor entrypoints. It also seems that a proof-of-concept for
experimentation in a limited context should be very doable.

Erik

[1]
https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text


_______________________________________________
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

Reply via email to