Hi, I am in the process of indexing meta tags in html, I know that Nutch does not add the meta tag information to parse.MetaData itself, so I need to write a plugin to do that.

The problem is:
The code I find doing this work is parse.getData().getMeta().put(.......);

But the compiler says wrong to this code

I don't know what is the reason?

Thanks very much!

Mark





From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Crawling the web and going into depth
Date: Sun, 10 Jun 2007 18:58:40 +0200

Enzo Michelangeli wrote:
----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 5:48 PM

Enzo Michelangeli wrote:
----- Original Message ----- From: "Berlin Brown" <[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 11:24 AM

Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say that I want to go into the depth 3 levels
deep into the 20 urls.  Is that possible.

For example with the intranet crawl I would start with some seed URL and then go into some depth. How would I do that URLs fetched from
for example dmoz.

The only way I can imagine is doing it on a host-by-host basis, restricting the host you crawl at various stages with an URLFilter, e.g. by changing the content of regex-urlfilter.txt .

One simple and efficient way to limit the maximum depth (i.e. the number of path elements) for any given site is to ... count the slashes ;) You can do it in a regex, or you can implement your own URLFilter plugin that does exactly this.

Well, it depends on what you mean by "depth": maybe Berlin wants to limit the length of the chain of recursion (page1.html links to page2.html, which links to page3.html - and we stop there). Also, in these days many sites like blogs or CMS-based ones have dynamically-generated content, with no relationship between '/' and tree structure in the server's filesystem.

Yes, there could be different definitions of depth.

When it comes to depth as in the sense of proximity, i.e. how many levels removed the page is from the starting point - no problem with that either ;) Here's how you can do it: put a counter in CrawlDatum.metadata, and pass it around to newly discovered pages, increasing it by one. When you reach a limit, you stop adding outlinks from such pages.

If I'm not mistaken it could be handled throughout the whole cycle if you use a ScoringPlugin.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



_________________________________________________________________
享用世界上最大的电子邮件系统― MSN Hotmail。 http://www.hotmail.com

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to