Re: [Wiki-research-l] category extraction question

Bowen Yu Mon, 10 Jul 2017 18:53:19 -0700

Hi Leila,

I did something similar before. I was trying to create "top-level" category
labels for the articles, like history, society, technology, etc. I parsed
the wikitext in dump data to extract all the sub category labels of the
article. Also, by parsing pages of namespace 14, I created a
category-relation graph for all the category labels, where ideally, each
sub category can reach some "top-level" category. Then, for each article,
you can take the sub category label into the graph for the top-level
categories. More detail can be found in 3.3.2 Independent Variables -
Identity-based Attachment subsection in the paper. Hope it helps!


On Mon, Jul 10, 2017 at 8:45 PM, Stuart A. Yeates <syea...@gmail.com> wrote:

> The category system on en.wiki is not an IS-A system and there have been
> several discussions about making it it based on mathematical principals
> which have come to nothing because the consensus of editors is against it.
> The best way to think about categories is as a locally-faceted related
> links system.
>
> Having said that, Category:Wikipedia maintenance is an important root
> probably useful for separating  the wheat from the chaff. Most of these are
> also hidden categories. I'm not sure whether this flag appears in the SQL,
> but see
> https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On 11 July 2017 at 13:20, Leila Zia <le...@wikimedia.org> wrote:
>
> > Hi all,
> >
> > [If you are not interested in discussions related to the category system
> >  (on English Wikipedia)
> > , you can stop here. :)]
> >
> > We have run into a problem that some of you may have thought about or
> > addressed before. We are trying to clean up the category system on
> English
> > Wikipedia by turning the category structure to an IS-A hierarchy. (The
> > output of this work can be useful for the research on template
> > recommendation [1], for example, but the use-cases won't stop there). One
> > issue that we are facing is the following:
> >
> > We are currently
> > using
> >  SQL dumps to extract categories associated with every article on English
> > Wikipedia (main namespace). [2]
> >  Using this approach, we get 5 categories associated with Flow cytometry
> > bioinformatics article [3]:
> >
> > Flow_cytometry
> > Bioinformatics
> >
> > Wikipedia_articles_published_in_peer-reviewed_literature
> > Wikipedia_articles_published_in_PLOS_Computational_Biology
> > CS1_maint:_Multiple_names:_authors_list
> >
> > The problem is that only the first two categories are the ones we are
> > interested in. We have one cleaning step through which we only keep
> > categories that belong to category Article and that step removes the last
> > category above, but the other two Wikipedia_... remain there. We need to
> > somehow prune the data and clean it from those two categories.
> >
> > One way we could do the above would be to parse wikitext instead of the
> SQL
> > dumps and focus on extracting categories marked by pattern
> [[Category:XX]],
> > but in that case, we would lose a good category such as
> > Guided_missiles_of_Norway
> >  because that's generated by a template.
> >
> > Any ideas on how we can start with a "cleaner" dataset of categories
> > related to the topic of the articles as opposed to maintenance related or
> > other types of categories?
> >
> > Thanks,
> > Leila
> >
> > [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
> > _stubs_across_languages
> >
> > [2] The exact code we use is
> >
> > SELECT p.page_id id, p.page_title title, cl.cl_to category
> > FROM categorylinks cl
> > JOIN page p
> > on cl.cl_from = p.page_id
> > where cl_type = 'page'
> > and page_namespace = 0
> > and page_is_redirect = 0
> >
> > and the edges of the category graph are extracted with
> >
> > *SELECT p.page_title category, cl.cl_to parent *
> > *FROM categorylinks cl *
> > *JOIN page p *
> > *ON p.page_id = cl.cl_from *
> > *where p.page_namespace = 14*
> >
> >
> > [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] category extraction question

Reply via email to