For now I don't see any form of accessing metadata for a previously parsed document, I'm mistaken?
----- Mensaje original ----- De: alx...@aim.com Para: user@nutch.apache.org Enviados: Jueves, 29 de Noviembre 2012 13:38:43 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, Unfortunately, my employer does not want me to disclose details of the plugin at this time. Alex. -----Original Message----- From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu> To: user <user@nutch.apache.org> Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url Hi Alex: What you've done is basically what I'm try to accomplish: I'm trying to get the text surrounding the img tags to improve the image search engine we're building (this is done when the html page containing the img tag is parsed), and when the image url itself is parsed we generate thumbnails and extract some metadata. But how do you keep the this 2 pieces of data linked together inside your index (solr in my case). Because the thing is that I'm getting two documents inside solr (1. containing the text surrounding the img tag, and other document with the thumbnail). So what brings me troubles is how when the thumbnail is being generated can I get the surrounding text detecte when the html was parsed? Thanks a lot for all the replies! P.S: Alex, can you share some piece of code (if it's possible) of your working plugins? Or walk me through what you've came up with? ----- Mensaje original ----- De: alx...@aim.com Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 19:54:07 Asunto: Re: Access crawled content or parsed data of previous crawled url It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -----Original Message----- From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu> To: user <user@nutch.apache.org> Sent: Wed, Nov 28, 2012 2:58 pm Subject: Re: Access crawled content or parsed data of previous crawled url Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to retrieve a documento by it's url (which is basically what I need. I'm also open to any suggestion on this matter, so If any one has done something similar or has any thoughts on this and can share it, I'll be very grateful. Greetings! ----- Mensaje original ----- De: "Stefan Scheffler" <sscheff...@avantgarde-labs.de> Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 15:04:44 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, I think, this is possible, because you can write a ParserPlugin which access the allready stored documents via the segments- /crawldb api. But i´m not sure how it will work exactly. Regards Stefan Re Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez: > Hi: > > For what I've seen in nutch plugins exist the philosophy of one NutchDocument per url, but I was wondering if there is any way of accessing parsed/crawled content of a previous fetched/parsed url, let's say for instance that I've a HTML page with an image embedded: So the start point will be http://host.com/test.html which is the first document that get's fetched/parsed then the OutLink extractor will detect the embedded image inside test.html and then add the url in the src attribute of the <img> tag, so then the image url will be fetched and then parsed. My question: Is possible, when the image is getting parsed, to access the content and parsed data of test.html? I'm trying to add some data present on the HTML page as a new metadata field of the image, and I'm not quite sure on how to accomplish this. > > Greetings in advance! > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > http://www.uci.cu > http://www.facebook.com/universidad.uci > http://www.flickr.com/photos/universidad_uci -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci