Re: Access crawled content or parsed data of previous crawled url

Jorge Luis Betancourt Gonzalez Thu, 29 Nov 2012 10:47:18 -0800

For now I don't see any form of accessing metadata for a previously parsed 
document, I'm mistaken?


----- Mensaje original -----
De: alx...@aim.com
Para: user@nutch.apache.org
Enviados: Jueves, 29 de Noviembre 2012 13:38:43
Asunto: Re: Access crawled content or parsed data of previous crawled url

Hi,

Unfortunately, my employer does not want me to disclose details of the plugin 
at this time.

Alex.







-----Original Message-----
From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
To: user <user@nutch.apache.org>
Sent: Wed, Nov 28, 2012 6:20 pm
Subject: Re: Access crawled content or parsed data of previous crawled url


Hi Alex:

What you've done is basically what I'm try to accomplish: I'm trying to get the
text surrounding the img tags to improve the image search engine we're building
(this is done when the html page containing the img tag is parsed), and when the
image url itself is parsed we generate thumbnails and extract some metadata. But
how do you keep the this 2 pieces of data linked together inside your index
(solr in my case). Because the thing is that I'm getting two documents inside
solr (1. containing the text surrounding the img tag, and other document with
the thumbnail). So what brings me troubles is how when the thumbnail is being
generated can I get the surrounding text detecte when the html was parsed?

Thanks a lot for all the replies!

P.S: Alex, can you share some piece of code (if it's possible) of your working
plugins? Or walk me through what you've came up with?

----- Mensaje original -----
De: alx...@aim.com
Para: user@nutch.apache.org
Enviados: Miércoles, 28 de Noviembre 2012 19:54:07
Asunto: Re: Access crawled content or parsed data of previous crawled url

It is not clear what you try to achieve. We have done something similar in 
regard of indexing img tags. We retrieve img tag data while parsing the html
page  and keep it in a metadata and when parsing img url itself we create
thumbnail.

hth.
Alex.







-----Original Message-----
From: Jorge Luis Betancourt Gonzalez <jlbetanco...@uci.cu>
To: user <user@nutch.apache.org>
Sent: Wed, Nov 28, 2012 2:58 pm
Subject: Re: Access crawled content or parsed data of previous crawled url


Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to
retrieve a documento by it's url (which is basically what I need. I'm also open
to any suggestion on this matter, so If any one has done something similar or
has any thoughts on this and can share it, I'll be very grateful.

Greetings!

----- Mensaje original -----
De: "Stefan Scheffler" <sscheff...@avantgarde-labs.de>
Para: user@nutch.apache.org
Enviados: Miércoles, 28 de Noviembre 2012 15:04:44
Asunto: Re: Access crawled content or parsed data of previous crawled url

Hi,
I think, this is possible, because you can write a ParserPlugin which
access the allready stored documents via the segments- /crawldb api.
But i´m not sure how it will work exactly.

Regards
Stefan

Re
Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez:
> Hi:
>
> For what I've seen in nutch plugins exist the philosophy of one NutchDocument
per url, but I was wondering if there is any way of accessing parsed/crawled
content of a previous fetched/parsed url, let's say for instance that I've a
HTML page with an image embedded: So the start point will be
http://host.com/test.html which is the first document that get's fetched/parsed
then the OutLink extractor will detect the embedded image inside test.html and
then add the url in the src attribute of the <img> tag, so then the image url
will be fetched and then parsed. My question: Is possible, when the image is
getting parsed, to access the content and parsed data of test.html? I'm trying
to add some data present on the HTML page as a new metadata field of the image,
and I'm not quite sure on how to accomplish this.
>
> Greetings in advance!
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci


--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Access crawled content or parsed data of previous crawled url

Reply via email to