Hi all, I've noticed that when doing a segment dump using readseg, several instances of the same CrawlDatum can be present in a given record. For example I have a segment with one single url (http://www.moma.org) and here is the dump below. I ran the following command: nutch readseg -dump segments/20070517113941 segdump -nocontent -noparsedata -noparsetext
Here is the first record: Recno:: 0 URL:: http://www.moma.org/ CrawlDatum:: Version: 5 Status: 1 (db_unfetched) Fetch time: Thu May 17 11:39:34 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: _ngt_:1179416381663 CrawlDatum:: Version: 5 Status: 65 (signature) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 0.0 days Score: 1.0 Signature: fe47b3db7c988541287fc6412ce0b923 Metadata: null CrawlDatum:: Version: 5 Status: 33 (fetch_success) Fetch time: Thu May 17 11:39:49 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: fe47b3db7c988541287fc6412ce0b923 Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0 Why are there 3 CrawlDatum fields? I assumed there would be only one CrawlDatum with status 33 (fetch_success). What is the purpose of the other two? Now, here is the 5th record: Recno:: 5 URL:: http://www.moma.org/application/x-shockwave-flash CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null CrawlDatum:: Version: 5 Status: 67 (linked) Fetch time: Thu May 17 11:39:51 EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.03846154 Signature: null Metadata: null There are 6 CrawlDatum fields and all of them are exactly identical. Is this a bug or am I missing something here? Any light on this matter would be greatly appreciated. Thank you. Florent ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
