[Nutch-general] How to add parsed metadata to Parse.getData?

Li Zheng wei Tue, 12 Jun 2007 15:39:42 -0700

Hi, I am in the process of indexing meta tags in html, I know that Nutchdoes not add the meta tag information to parse.MetaData itself, so I needto write a plugin to do that.


The problem is:

The code I find doing this work isparse.getData().getMeta().put(.......);


But the compiler says wrong to this code

I don't know what is the reason?

Thanks very much!

Mark

From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Crawling the web and going into depth
Date: Sun, 10 Jun 2007 18:58:40 +0200

Enzo Michelangeli wrote:
----- Original Message ----- From: "Andrzej Bialecki"<[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 5:48 PM
Enzo Michelangeli wrote:
----- Original Message ----- From: "Berlin Brown"<[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set fromtheDmozParser. Lets also say that I want to go into the depth 3levels
deep into the 20 urls.  Is that possible.
For example with the intranet crawl I would start with some seedURLand then go into some depth. How would I do that URLs fetchedfrom
for example dmoz.
The only way I can imagine is doing it on a host-by-host basis,restricting the host you crawl at various stages with anURLFilter, e.g. by changing the content of regex-urlfilter.txt .
One simple and efficient way to limit the maximum depth (i.e. thenumber of path elements) for any given site is to ... count theslashes ;) You can do it in a regex, or you can implement your ownURLFilter plugin that does exactly this.
Well, it depends on what you mean by "depth": maybe Berlin wants tolimit the length of the chain of recursion (page1.html links topage2.html, which links to page3.html - and we stop there). Also,in these days many sites like blogs or CMS-based ones havedynamically-generated content, with no relationship between '/' andtree structure in the server's filesystem.
Yes, there could be different definitions of depth.
When it comes to depth as in the sense of proximity, i.e. how manylevels removed the page is from the starting point - no problem withthat either ;) Here's how you can do it: put a counter inCrawlDatum.metadata, and pass it around to newly discovered pages,increasing it by one. When you reach a limit, you stop addingoutlinks from such pages.
If I'm not mistaken it could be handled throughout the whole cycleif you use a ScoringPlugin.
--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


_________________________________________________________________

享用世界上最大的电子邮件系统― MSN Hotmail。 http://www.hotmail.com

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] How to add parsed metadata to Parse.getData?

Reply via email to