Hey everyone,
I’m trying to setup the Dbpedia extraction framework as I’m interested in
getting structured data from already downloaded wikipedia dumps. As per my
understanding I need to work in the ‘dump’ directory of the codebase. I have
tried to reverse engineer ( given scala is new for me) but I need some help.
1. First of all, is there a more detailed documentation somewhere about
setting and running the pipeline. The one available on dbpedia.org seems
insufficient.
2. I understand that I need to create a config.properties file first where I
need to setup input/output locations, list of extractors and the languages. I
tried working with the config.properties.default given in the code. There seems
to be some typo in the extractor list.
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using
this gives ‘class not found’ error. I converted it to
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
3. I can’t find the documentation on how to setup the input directory. Can
someone tell me the details? From what I gather, input directory should contain
a ‘commons’ directory plus, directory for all languages set in
config.properties. All these directories must have a subdirectory whose name
should be of YYYYMMDD format. Within that you save the xml files such as
enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on any
particular dump of Wikipedia? Also what goes in the commons branch ?
4. I ran the framework by copying a sample dump
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
in both en and commons branch. Unzipping them and renaming as per requirement.
For now I’m working with en language only. It works with the default 19
extractors but, starts failing if I include AbstractExtractor. It seems the
AbstractExtractor requires an instance of Mediawiki running to parse mediawiki
syntax. From the file itself, “DBpedia-customized MediaWiki instance is
required.” Can someone shed some more light on this ? What customization is
required ? Where can I get one ?
Sorry if the question are too basic and already mentioned somewhere. I have
tried looking but couldn’t find myself.
Also another question: Is there a reason for the delay in subsequent Dbpedia
releases ? I was wondering , if the code is already there, why does it take 6
months between Dbpedia releases? Is there a manual editorial involved or is it
due to development/changes in the framework code which are collated in every
release?
Thanks and regards,
Amit
Tech Lead
Cloud and Platform Group
Yahoo!
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion