Hi, Amir, I have some experience on topic modeling but these may not be a direct answer.
The most adopted techniques to model topics of documents is LDA[1] or LSI[2]. Under these techniques, document is viewed as a mixture of topics, while topic is a mixture of words. Both methods are well implemented in different language, for example, gensim[3] in python. But these methods are relatively expensive. Last year a word vector model - word2vec[4] - was introduced by Google. By combining a topic catalog, we can easily decide which topic an article belongs to. The topic catalog is just a list of topics and each topic is a list of related words. We released one open-sourced project on this direction: * https://github.com/guokr/simbase And another planned project on the topic catalog * https://github.com/guokr/opentopics We will update the catalog in the coming weeks and give more details. [1]https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation [2]https://en.wikipedia.org/wiki/Latent_semantic_indexing [3]https://en.wikipedia.org/wiki/Gensim [4]https://code.google.com/p/word2vec/ On Mon, Mar 17, 2014 at 11:21 PM, Amir E. Aharoni < [email protected]> wrote: > Hallo, > > Is there any known easy way to classify Wikipedia articles into a > relatively small number of types? > > By "relatively small" I mean no more than twenty, and by "types" I mean > things that are intuitively clear to readers, for example: > * Biographies > * Articles about scientific phenomena (can be sub-grouped to math, > astronomy, physics, geology, medicine) > * Articles about works of art (paintings, movies, books, records, statues) > * Articles about places > * Articles about historical events > * Articles about biological species > * Articles that mostly present data, such as demography or results of > competitions (sports, elections, game shows) > > There are a few more, but not much. I hope that you get the idea. > > We have categories, but I'm not sure that it's easy to use categories for > such things because of the very loose category structure. For example, > [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though > it's not an article about a human. > > Such information can be useful for study about the types of articles that > different people write. In particular, I thought about it in the context of > analyzing the types of articles that people are translating now (manually) > and will translate in the future using the ContentTranslation, which is in > its early stages of development. > > Thanks, > > -- > Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי > http://aharoni.wordpress.com > “We're living in pieces, > I want to live in peace.” – T. Moore > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
