[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-455: ------------------------------------ Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > dedup on tokenized fields is faulty > ----------------------------------- > > Key: NUTCH-455 > URL: https://issues.apache.org/jira/browse/NUTCH-455 > Project: Nutch > Issue Type: Bug > Components: searcher > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Attachments: IndexSearcherCacheWarm.patch > > > (From LUCENE-252) > nutch uses several index servers, and the search results from these servers > are merged using a dedup field for for deleting duplicates. The values from > this field is cached by Lucene's FieldCachImpl. The default is the site > field, which is indexed and tokenized. However for a Tokenized Field (for > example "url" in nutch), FieldCacheImpl returns an array of Terms rather that > array of field values, so dedup'ing becomes faulty. Current FieldCache > implementation does not respect tokenized fields , and as described above > caches only terms. > So in the situation that we are searching using "url" as the dedup field, > when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of > the url (such as "www" or "com") rather that the whole url. This prevents > using tokenized fields in the dedup field. > I have written a patch for lucene and attached it in > http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the > aforementioned issue about tokenized field caching. However building such a > cache for about 1.5M documents takes 20+ secs. The code in > IndexSearcher.translateHits() starts with > if (dedupField != null) > dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); > and for the first call of search in IndexSearcher, cache is built. > Long story short, i have written a patch against IndexSearcher, which in > constructor warms-up the caches of wanted fields(configurable). I think we > should vote for LUCENE-252, and then commit the above patch with the last > version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.