> > DeleteDuplicates removes documents having the same digest or the same > url. If you use the TextProfileSigniture instead of MD5Signiture, it > will remove near similar documents. The MD5Signiture class set digest as > the md5 of all the content, whereas textProfileSigniture sets digest as > the md5 of significant terms. You should check the class for > implementation details. also look at signitureFactory for how to change > the configuration.
DeleteDuplicates does NOT delete same URLs, it compares only the digest. See Nutch 371 http://www.mail-archive.com/[email protected]/msg04635.html In fact I have some important URLs in every single segment (although this should not happen because I generate with the topN option. Maybe topN doesn't look in the crawldb or so.) ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
