RE: Access crawled content or parsed data of previous crawled url

Markus Jelsma Thu, 29 Nov 2012 11:57:07 -0800

Hi,

This is a difficult problem in MapReduce and because of the fact that one image 
URL may be embedded in many documents. There are various methods you could use 
to aggregate the records but none i can think of will work very well or are 
straightforward to implement.


I think the most straightforward and easy to implement method is that you 
should create a new key/value pair to store the surrounding text in for each 
image, do this during the parse. This would mean you have to emit a Text,Text 
pair for each image in every HTML page with the image's URLas key and the 
surrounding text as value. You will have to modify the indexer to ingest that 
structure as well during indexing. This way existing CrawlDatum's for existing 
images will end up in the reducer together with zero or more of your new 
key/value pair. In IndexerMapReduce you can deal with them appropriately.

This method works well with MapReduce and requires not too much programming. 
The downside is that you cannot build a parse plugin and indexing plugin 
because they cannot handle your new key/value pair.

Good luck and let us know what you came up with :)
 
-----Original message-----
> From:Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
> Sent: Thu 29-Nov-2012 19:53
> To: user@nutch.apache.org
> Subject: Re: Access crawled content or parsed data of previous crawled url
> 
> For now I don't see any form of accessing metadata for a previously parsed 
> document, I'm mistaken?
> 
> ----- Mensaje original -----
> De: alx...@aim.com
> Para: user@nutch.apache.org
> Enviados: Jueves, 29 de Noviembre 2012 13:38:43
> Asunto: Re: Access crawled content or parsed data of previous crawled url
> 
> Hi,
> 
> Unfortunately, my employer does not want me to disclose details of the plugin 
> at this time.
> 
> Alex.
> 
>  
> 
>  
> 
>  
> 
> -----Original Message-----
> From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
> To: user <user@nutch.apache.org>
> Sent: Wed, Nov 28, 2012 6:20 pm
> Subject: Re: Access crawled content or parsed data of previous crawled url
> 
> 
> Hi Alex:
> 
> What you've done is basically what I'm try to accomplish: I'm trying to get 
> the 
> text surrounding the img tags to improve the image search engine we're 
> building 
> (this is done when the html page containing the img tag is parsed), and when 
> the 
> image url itself is parsed we generate thumbnails and extract some metadata. 
> But 
> how do you keep the this 2 pieces of data linked together inside your index 
> (solr in my case). Because the thing is that I'm getting two documents inside 
> solr (1. containing the text surrounding the img tag, and other document with 
> the thumbnail). So what brings me troubles is how when the thumbnail is being 
> generated can I get the surrounding text detecte when the html was parsed?
> 
> Thanks a lot for all the replies!
> 
> P.S: Alex, can you share some piece of code (if it's possible) of your 
> working 
> plugins? Or walk me through what you've came up with?
> 
> ----- Mensaje original -----
> De: alx...@aim.com
> Para: user@nutch.apache.org
> Enviados: Miércoles, 28 de Noviembre 2012 19:54:07
> Asunto: Re: Access crawled content or parsed data of previous crawled url
> 
> It is not clear what you try to achieve. We have done something similar in 
> regard of indexing img tags. We retrieve img tag data while parsing the html 
> page  and keep it in a metadata and when parsing img url itself we create 
> thumbnail.
> 
> hth.
> Alex.
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
> To: user <user@nutch.apache.org>
> Sent: Wed, Nov 28, 2012 2:58 pm
> Subject: Re: Access crawled content or parsed data of previous crawled url
> 
> 
> Any documentation about crawldb api? I'm guessing the it shouldn't be so hard 
> to
> retrieve a documento by it's url (which is basically what I need. I'm also 
> open
> to any suggestion on this matter, so If any one has done something similar or
> has any thoughts on this and can share it, I'll be very grateful.
> 
> Greetings!
> 
> ----- Mensaje original -----
> De: "Stefan Scheffler" <sscheff...@avantgarde-labs.de>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 28 de Noviembre 2012 15:04:44
> Asunto: Re: Access crawled content or parsed data of previous crawled url
> 
> Hi,
> I think, this is possible, because you can write a ParserPlugin which
> access the allready stored documents via the segments- /crawldb api.
> But i´m not sure how it will work exactly.
> 
> Regards
> Stefan
> 
> Re
> Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez:
> > Hi:
> >
> > For what I've seen in nutch plugins exist the philosophy of one 
> > NutchDocument
> per url, but I was wondering if there is any way of accessing parsed/crawled
> content of a previous fetched/parsed url, let's say for instance that I've a
> HTML page with an image embedded: So the start point will be
> http://host.com/test.html which is the first document that get's 
> fetched/parsed
> then the OutLink extractor will detect the embedded image inside test.html and
> then add the url in the src attribute of the <img> tag, so then the image url
> will be fetched and then parsed. My question: Is possible, when the image is
> getting parsed, to access the content and parsed data of test.html? I'm trying
> to add some data present on the HTML page as a new metadata field of the 
> image,
> and I'm not quite sure on how to accomplish this.
> >
> > Greetings in advance!
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> 
> 
> --
> Stefan Scheffler
> Avantgarde Labs GbR
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
>  
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

RE: Access crawled content or parsed data of previous crawled url

Reply via email to