Hey everyone,
I’m trying to setup the Dbpedia extraction framework as I’m interested in 
getting structured data from already downloaded wikipedia dumps.  As per my 
understanding  I need to work in the ‘dump’ directory of the codebase. I have 
tried to reverse engineer ( given scala is new for me) but I need some help.


 1.  First of all, is there a more detailed documentation somewhere about 
setting and running the pipeline. The one available on dbpedia.org seems 
insufficient.
 2.  I understand that I need to create a config.properties file first where I 
need to setup input/output locations, list of extractors and the languages. I 
tried working with the config.properties.default given in the code. There seems 
to be some typo in the extractor list. 
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using 
this gives ‘class not found’ error. I converted it to 
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
 3.  I can’t find the documentation on how to setup the input directory. Can 
someone tell me the details? From what I gather, input directory should contain 
a ‘commons’ directory plus, directory for all languages set in 
config.properties. All these directories must have a subdirectory whose name 
should be of YYYYMMDD format. Within that you save the xml files such as 
enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on any 
particular dump of Wikipedia? Also what goes in the commons branch ?
 4.  I ran the framework by copying a sample dump  
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
 in both en and commons branch. Unzipping them and renaming as per requirement. 
For now I’m working with en language only. It works with the default 19 
extractors but, starts failing if I include AbstractExtractor. It seems the 
AbstractExtractor requires an instance of Mediawiki running to parse mediawiki 
syntax. From the file itself, “DBpedia-customized MediaWiki instance is 
required.” Can someone shed some more light on this ? What customization is 
required ? Where can I get one ?


Sorry if the question are too basic and already mentioned somewhere. I have 
tried looking but couldn’t find myself.
Also another question: Is there a reason for the delay in subsequent Dbpedia 
releases ? I was wondering , if the code is already there, why does it take 6 
months between Dbpedia releases? Is there a manual editorial  involved or is it 
due  to development/changes  in the framework code which are collated in every 
release?


Thanks and regards,

Amit
Tech Lead
Cloud and Platform Group
Yahoo!
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to