Hi Leila, I did something similar before. I was trying to create "top-level" category labels for the articles, like history, society, technology, etc. I parsed the wikitext in dump data to extract all the sub category labels of the article. Also, by parsing pages of namespace 14, I created a category-relation graph for all the category labels, where ideally, each sub category can reach some "top-level" category. Then, for each article, you can take the sub category label into the graph for the top-level categories. More detail can be found in 3.3.2 Independent Variables - Identity-based Attachment subsection in the paper. Hope it helps!
On Mon, Jul 10, 2017 at 8:45 PM, Stuart A. Yeates <syea...@gmail.com> wrote: > The category system on en.wiki is not an IS-A system and there have been > several discussions about making it it based on mathematical principals > which have come to nothing because the consensus of editors is against it. > The best way to think about categories is as a locally-faceted related > links system. > > Having said that, Category:Wikipedia maintenance is an important root > probably useful for separating the wheat from the chaff. Most of these are > also hidden categories. I'm not sure whether this flag appears in the SQL, > but see > https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories > > cheers > stuart > > -- > ...let us be heard from red core to black sky > > On 11 July 2017 at 13:20, Leila Zia <le...@wikimedia.org> wrote: > > > Hi all, > > > > [If you are not interested in discussions related to the category system > > (on English Wikipedia) > > , you can stop here. :)] > > > > We have run into a problem that some of you may have thought about or > > addressed before. We are trying to clean up the category system on > English > > Wikipedia by turning the category structure to an IS-A hierarchy. (The > > output of this work can be useful for the research on template > > recommendation [1], for example, but the use-cases won't stop there). One > > issue that we are facing is the following: > > > > We are currently > > using > > SQL dumps to extract categories associated with every article on English > > Wikipedia (main namespace). [2] > > Using this approach, we get 5 categories associated with Flow cytometry > > bioinformatics article [3]: > > > > Flow_cytometry > > Bioinformatics > > > > Wikipedia_articles_published_in_peer-reviewed_literature > > Wikipedia_articles_published_in_PLOS_Computational_Biology > > CS1_maint:_Multiple_names:_authors_list > > > > The problem is that only the first two categories are the ones we are > > interested in. We have one cleaning step through which we only keep > > categories that belong to category Article and that step removes the last > > category above, but the other two Wikipedia_... remain there. We need to > > somehow prune the data and clean it from those two categories. > > > > One way we could do the above would be to parse wikitext instead of the > SQL > > dumps and focus on extracting categories marked by pattern > [[Category:XX]], > > but in that case, we would lose a good category such as > > Guided_missiles_of_Norway > > because that's generated by a template. > > > > Any ideas on how we can start with a "cleaner" dataset of categories > > related to the topic of the articles as opposed to maintenance related or > > other types of categories? > > > > Thanks, > > Leila > > > > [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia > > _stubs_across_languages > > > > [2] The exact code we use is > > > > SELECT p.page_id id, p.page_title title, cl.cl_to category > > FROM categorylinks cl > > JOIN page p > > on cl.cl_from = p.page_id > > where cl_type = 'page' > > and page_namespace = 0 > > and page_is_redirect = 0 > > > > and the edges of the category graph are extracted with > > > > *SELECT p.page_title category, cl.cl_to parent * > > *FROM categorylinks cl * > > *JOIN page p * > > *ON p.page_id = cl.cl_from * > > *where p.page_namespace = 14* > > > > > > [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics > > _______________________________________________ > > Wiki-research-l mailing list > > Wiki-research-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l