On 13 February 2012 13:01, Gregor Trefs <gtr...@rumms.uni-mannheim.de> wrote: > Hi DBPedia-Community, > > I'm currently writing my Master-Thesis in the field of DBPedia and SPARQL. > One of my subgoals is to find out how many categories are present in both > Wikipedia and DBPedia. Therefore, I wrote a little tool which identifies all > categories having at least one resource in the unspecific mapping based part > of DBPedia (If I refer to DBPedia in this mail, I usually mean this part of > DBPedia not the whole one.). It searches the file > mapping_based_properties_en.nt and looks whether or not the object and > subject of each statement is linked to a category in the file > article_categories_en.nt. If there is a link, the tool considers the > corresponding category to be 'present' in DBPedia. > > On the other hand, the same tool searches the page_links_en.nt file to find > all categories of Wikipedia. That is, all triples which relate a resource to > a category or (if present at all) a category to any object. According to the > description of the 'Page Links Extractor' it 'Extracts internal links > between DBpedia instances from the internal pagelinks between Wikipedia > articles.'. As Wikipedia pages normally link to their categories, I assumed > that these links are also included and, thus, all categories in Wikipedia > are captured. >
Categories can also be added by templates. > Unfourtnately, this is only true for almost all categories. I found 127 > categories which are present in DBPedia but not in Wikipedia, compared to > 59099 categories present in Wikipedia and not in DBPedia. This is strange, > as the set of DBPedia categories must be a subset of Wikipedia categories. > Otherwise, some magic added some new categories during extraction and I > doubt that. As Yury said, it's more likely that those articles have changed since the last extraction. > I made sure, it was not my fault and had a look on the data. One > of the suddenly appeared categories is > http://dbpedia.org/resource/Category:Alaska_elections,_1996. On the > DBPediasian side, there is a triple > (<http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> > <http://purl.org/dc/terms/subject> > <http://dbpedia.org/resource/Category:Alaska_elections,_1996> .) which > relates this category to the United states Senate election in Alaska in > 1996. The resource itself is subject of two statements in > mapping_based_properties_en.nt. On the Wikipediasian side, As an aside, I originally thought that you were talking about some Asia-specific version of Wikipedia, and now put it down to some sort of interlanguage effect. If it's the latter, adjectives formed from English nouns ending -a typically have the ending -an (Wikipedia -> Wikipedian), but it's generally preferable, especially with proper nouns, to just use the noun as a modifier ('On the Wikipedia side'). > I did not find > any triple in page_links_en.nt which contained the category. But I did find > the United states senate election in Alaska in 1996 resource. The > corresponding Wikipedia page also includes a link to the category. It is > present since page creation. page_links is meant to capture _normal_ wiki links found in the body of the text, article_categories is specifically for categories. $ bzgrep 'http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996' article_categories_en.nt.bz2 |grep 'http://purl.org/dc/terms/subject' <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:United_States_Senate_elections,_1996> . <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:United_States_Senate_elections_in_Alaska> . <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Alaska_elections,_1996> . I believe these are the triples you're looking for. If you find yourself wondering if you're looking in the right file, bear in mind that you can always use the website: $ curl http://dbpedia.org/data/United_States_Senate_election_in_Alaska,_1996.ntriples|grep 'http://purl.org/dc/terms/subject' <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:United_States_Senate_elections_in_Alaska> . <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Alaska_elections,_1996> . <http://dbpedia.org/resource/United_States_Senate_election_in_Alaska,_1996> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:United_States_Senate_elections,_1996> . -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion