Hi, Amir,

I have some experience on topic modeling but these may not be a direct
answer.

The most adopted techniques to model topics of documents is LDA[1] or
LSI[2].
Under these techniques, document is viewed as a mixture of topics, while
topic is a mixture of words.
Both methods are well implemented in different language, for example,
gensim[3] in python.
But these methods are relatively expensive.

Last year a word vector model - word2vec[4] -  was introduced by Google.
By combining a topic catalog, we can easily decide which topic an article
belongs to.
The topic catalog is just a list of topics and each topic is a list of
related words.

We released one open-sourced project on this direction:
* https://github.com/guokr/simbase

And another planned project on the topic catalog
* https://github.com/guokr/opentopics

We will update the catalog in the coming weeks and give more details.

[1]https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
[2]https://en.wikipedia.org/wiki/Latent_semantic_indexing
[3]https://en.wikipedia.org/wiki/Gensim
[4]https://code.google.com/p/word2vec/






On Mon, Mar 17, 2014 at 11:21 PM, Amir E. Aharoni <
[email protected]> wrote:

> Hallo,
>
> Is there any known easy way to classify Wikipedia articles into a
> relatively small number of types?
>
> By "relatively small" I mean no more than twenty, and by "types" I mean
> things that are intuitively clear to readers, for example:
> * Biographies
> * Articles about scientific phenomena (can be sub-grouped to math,
> astronomy, physics, geology, medicine)
> * Articles about works of art (paintings, movies, books, records, statues)
> * Articles about places
> * Articles about historical events
> * Articles about biological species
> * Articles that mostly present data, such as demography or results of
> competitions (sports, elections, game shows)
>
> There are a few more, but not much. I hope that you get the idea.
>
> We have categories, but I'm not sure that it's easy to use categories for
> such things because of the very loose category structure. For example,
> [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though
> it's not an article about a human.
>
> Such information can be useful for study about the types of articles that
> different people write. In particular, I thought about it in the context of
> analyzing the types of articles that people are translating now (manually)
> and will translate in the future using the ContentTranslation, which is in
> its early stages of development.
>
> Thanks,
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to