Hi Pablo,
For the continuous extraction we are trying to setup a pipeline, which polls
and downloads the Wikipedia data, passes it through DEF(Dbpedia Extraction
Framework) and then create knowledgebases. Many of the plumbing is handled by
Yahoo! Internal tools and platform but there are some pieces which might be
useful for the Dbpedia community. I’m mentioning some below. Let me know if you
think you can use anyone. if yes, I would contact our Open Source Working Group
Manager to take it forward.
1. Wiki Downloader : We have two components.
* Full Downloader: A basic bash script which poll the latest folder of
wikipedia dumps. Check if a new dumps is available and downloads it to a dated
folder.
* Incremental Downloader: It includes an IRC bot which keeps listening
to wikipedia IRC channel. It makes a list of files which were updated. It
De-dups and downloads those pages every few hours while respecting the
wikipedia QPS.
2. Def Wrapper: A bash script which invokes the DEF on the data generated by
the downloader.
Both these have some basic notifications and error handling. There are some
stuff after DEF, but they are quite internal to Yahoo!.
I think you already have a download.scala which downloads the dbpedia dumps.
There were few mails in the last week about the same. If you are facing some
particular issue in particular with DBpedia Portuguese, do let me know. If we
have faced the same, we would let you know.
Regards
Amit
On 3/19/12 3:45 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:
Hi Amit,
>"We have been trying to setup an instance of dbpedia to continously extract
>data from wikipedia dumps/updates. While"
We would like to do the same for the DBpedia Portuguese. If you can share any
code, it would be much appreciated.
Cheers
Pablo
On Mar 19, 2012 10:38 AM, "Amit Kumar" <amitk...@yahoo-inc.com> wrote:
Hi,
We have been trying to setup an instance of dbpedia to continously extract data
from wikipedia dumps/updates. While going through the output we observed that
the image extractor was only picking up the first image for any page.
I can see commented out code present in the ImageExtractor which seems to pick
all images. In place of that we have the code which returns on the first image
it encounters. My questions are :
1. Does the commented out code actually works ? Does it really pick all the
images on a particular page?
2. Why was the change made in the code ?
Thanks and Regards
Amit
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion