Hi Jason,
I have some starting Nutch questions that I am hoping to gain insight about.
I want to start at Dmoz.org and follow links for entertainment (like
concerts, art gallery events, etc) and examine the link to see if I
should get data back about it and from it.
My questions:
Before replying, I assume you've read through pages referenced from
http://wiki.apache.org/nutch/ - they have a lot of valuable
information.
1. Can Nutch start at a given URL and examine every link (based upon
my criteria)? (obviously I can write Case or If/Else or While to do
this)
Nutch can do a general crawl starting from a set of "seed" URLs.
For URL filtering (which is what I think you want), you'd have to
define better what your criteria is. You can set up regular
expressions to filter out URLs, but it sounds like you want something
different.
2. If I find a link that has certain keywords that I find of
interest, can I hit that link of interest and get information from
that page?
Not sure what you mean by "get information from that page". You can
probably define a set of URL filters that will only pass through
links with your keywords, and then Nutch will (eventually) crawl
those pages.
3. How do I get the information about the link of interest and its
content of interest into a MySQL database? (I know ColdFusion and
MySQL and PHP). I think what I am asking is how do I get back to my
database from a crawler?
You'd need to dump the contents of the CrawlDB and the "segments"
where Nutch has stored its fetched content, and use that to import
into your DB.
I'm not current with what Nutch offers in the way of exporting
content for re-import into a SQL DB, others would know better.
4. As I know Nutch is Java, which is fine, I will need Tomcat
running etc. Are there other java App Servers out there as well for
OS X?
You don't need a Java webapp container (server) for Nutch, just a
server or set of servers with a current version of Java installed.
See pages like http://lucene.apache.org/nutch/tutorial.html for what
you'd need for a valid server configuration.
5. Does anyone have deployment instructions for OS X?
http://wiki.apache.org/nutch/GettingNutchRunningWithMacOsx
Am I making any sense?
Yes :)
But I think working with Nutch might be challenging if you don't know
at least something about Java (and Linux/Bash).
-- Ken
--
Ken Krugler
+1 530-210-6378