Re: [Dbpedia-discussion] ImageExtractor issue

Amit Kumar Mon, 19 Mar 2012 03:47:48 -0700

Hi Pablo,
For the continuous extraction we are trying to setup a pipeline, which polls 
and downloads the Wikipedia data, passes it through DEF(Dbpedia Extraction 
Framework) and then create knowledgebases. Many of the plumbing is handled by 
Yahoo! Internal tools and platform but there are some pieces which might be 
useful for the Dbpedia community. I’m mentioning some below. Let me know if you 
think you can use anyone. if yes, I would contact our Open Source Working Group 
Manager to take it forward.

 1.  Wiki Downloader : We have two components.
    *   Full Downloader: A basic bash script which poll the latest folder of 
wikipedia dumps. Check if a new dumps is available and downloads it to a dated 
folder.
    *   Incremental Downloader: It includes  an IRC bot which keeps listening 
to wikipedia IRC channel. It makes a list of files which were updated. It 
De-dups  and downloads those pages every few hours while respecting the 
wikipedia QPS.
 2.  Def Wrapper: A bash script which invokes the DEF on the data generated by 
the downloader.

Both these have some basic notifications and error handling. There are some 
stuff after DEF, but they are quite internal to Yahoo!.

I think you already have a download.scala which downloads the dbpedia dumps. 
There were few mails in the last week about the same. If you are facing some 
particular issue in particular with DBpedia Portuguese, do let me know. If we 
have faced the same, we would let you know.

Regards
Amit

On 3/19/12 3:45 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:

Hi Amit,

>"We have been trying to setup an instance of dbpedia to continously extract 
>data from wikipedia dumps/updates. While"

We would like to do the same for the DBpedia Portuguese. If you can share any 
code, it would be much appreciated.

Cheers
Pablo

On Mar 19, 2012 10:38 AM, "Amit Kumar" <amitk...@yahoo-inc.com> wrote:
Hi,
We have been trying to setup an instance of dbpedia to continously extract data 
from wikipedia dumps/updates. While going through the output we observed that 
the image extractor was only picking up the first image for any page.

I can see  commented out code present in the ImageExtractor which seems to pick 
all images. In place of that we  have the code which returns on the first image 
it encounters. My questions are :

 1.  Does the commented out code actually works ? Does it really pick all the 
images on a particular page?
 2.  Why was the change made in the code ?

Thanks and Regards
Amit

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] ImageExtractor issue

Reply via email to