Hi Jason,

I have some starting Nutch questions that I am hoping to gain insight about.

I want to start at Dmoz.org and follow links for entertainment (like concerts, art gallery events, etc) and examine the link to see if I should get data back about it and from it.

My questions:

Before replying, I assume you've read through pages referenced from http://wiki.apache.org/nutch/ - they have a lot of valuable information.

1. Can Nutch start at a given URL and examine every link (based upon my criteria)? (obviously I can write Case or If/Else or While to do this)

Nutch can do a general crawl starting from a set of "seed" URLs.

For URL filtering (which is what I think you want), you'd have to define better what your criteria is. You can set up regular expressions to filter out URLs, but it sounds like you want something different.

2. If I find a link that has certain keywords that I find of interest, can I hit that link of interest and get information from that page?

Not sure what you mean by "get information from that page". You can probably define a set of URL filters that will only pass through links with your keywords, and then Nutch will (eventually) crawl those pages.

3. How do I get the information about the link of interest and its content of interest into a MySQL database? (I know ColdFusion and MySQL and PHP). I think what I am asking is how do I get back to my database from a crawler?

You'd need to dump the contents of the CrawlDB and the "segments" where Nutch has stored its fetched content, and use that to import into your DB.

I'm not current with what Nutch offers in the way of exporting content for re-import into a SQL DB, others would know better.

4. As I know Nutch is Java, which is fine, I will need Tomcat running etc. Are there other java App Servers out there as well for OS X?

You don't need a Java webapp container (server) for Nutch, just a server or set of servers with a current version of Java installed. See pages like http://lucene.apache.org/nutch/tutorial.html for what you'd need for a valid server configuration.

5. Does anyone have deployment instructions for OS X?

http://wiki.apache.org/nutch/GettingNutchRunningWithMacOsx

Am I making any sense?

Yes :)

But I think working with Nutch might be challenging if you don't know at least something about Java (and Linux/Bash).

-- Ken
--
Ken Krugler
+1 530-210-6378

Reply via email to