Hi All: I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running in local mode (runtime/local), on a Centos box.
I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, crawl another dataset2 or urls without touching the results of dataset1, I want to avoid going through depth 2 of the first dataset, I want all data to live within the same database (or keyspace in cassandra). Here is what I did: 1- dropped the keyspace "webpage" from Cassandra 2- changed the value of the property "storage.schema.webpage" from "webpage" to "database1" in both nutch-default and nutch-site. 3- Reran "ant runtime" just to make sure these changes are reflected in my local deployment. 4- ran the nutch crawl script. but fetch results are still written to a new created keyspace "webpage", not the one I specified in the conf file, I'm unable to change the db destination of where data is going. I tried to google the "crawlid" parameter, I can pass it to the crawl script, but I'm unable to figure out how to read it back from cassandra, the gora schema mapping does not mention it anywhere, it only mentions "bas" which is the batchid. I thought I can use two different crawl ids for the two different url datasets I have, so later I can query each set separately... So either solution will help me a lot: either figure out how to change the destination db in Cassandra, or if the crawlid can help in identifying results from two distinct crawls. Any hints will be appreciated ! Thanks.

