Identifying results from two distinct crawls in Nutch 2.2.1

Tamer Yousef Tue, 16 Dec 2014 12:18:11 -0800

Hi All:
I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running 
in local mode (runtime/local), on a Centos box.


I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, 
crawl another dataset2 or urls without touching the results of dataset1, I want 
to avoid going through depth 2 of the first dataset,  I want all data to live 
within the same database (or keyspace in cassandra).
Here is what I did:

1-      dropped the keyspace "webpage" from Cassandra

2-      changed the  value of the property "storage.schema.webpage" from 
"webpage" to "database1" in both nutch-default and nutch-site.

3-      Reran "ant runtime"  just to make sure these changes are reflected in 
my local deployment.

4-      ran the nutch crawl script.

but fetch results are still written to a new created keyspace "webpage", not 
the one I specified in the conf file, I'm unable to change the db destination 
of where data is going.

I tried to google the "crawlid" parameter, I can pass it to the crawl script, 
but I'm unable to figure out how to read it back from cassandra, the gora 
schema mapping does not mention it anywhere, it only mentions "bas" which is 
the batchid. I thought I can use two different crawl ids for the two different 
url datasets I have, so later I can query each set separately...

So either solution will help me a lot: either figure out how to change the 
destination db in Cassandra, or if the crawlid can help in identifying results 
from two distinct crawls.


Any hints will be appreciated !

Thanks.

Identifying results from two distinct crawls in Nutch 2.2.1

Reply via email to