As the size of my data keeps growing, and the indexing time grows even faster, I'm trying to switch from a "reindex all at every crawl" model to an incremental indexing one. I intend to keep the segments separate, but I want to index only the segment fetched during the last cycle, and then merge indexes and perhaps linkdb. I have a few questions:
1. In an incremental scenario, how do I remove from the indexes references to segments that have expired?? 2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that I can call "bin/nutch merge" with only two parameters: the original index directory as destination, and the directory to be merged in the former: $nutch_dir/nutch merge $index_dir $new_indexes But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn't I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory: merged_indexes=$crawl_dir/merged_indexes rm -rf $merged_indexes # just in case it's already there $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes rm -rf $index_dir.old # just in case it's already there mv $index_dir $index_dir.old mv $merged_indexes $index_dir rm -rf $index_dir.old 3. Regarding linkdb, does running "$nutch_dir/nutch invertlinks" on the latest segment only, and then merging the newly obtained linkdb with the current one with "$nutch_dir/nutch mergelinkdb", make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly? Thanks, Enzo ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
