[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

Doug Cutting (JIRA) Wed, 07 Mar 2007 11:07:44 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854
 ]


Doug Cutting commented on NUTCH-455:
------------------------------------

Alternately, we could define it as an error to attempt to dedup by a tokenized 
field.  That's the (undocumented) expectation of FieldCache.  Using documents 
to populate a FieldCache for tokenized fields is very slow.  It's better to add 
an untokenized version and use that, no?  If you agree, then the more 
appropriate fix is to document the restriction and try to check for it at 
runtime.

> dedup on tokenized fields is faulty
> -----------------------------------
>
>                 Key: NUTCH-455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-455
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>             Fix For: 0.9.0
>
>         Attachments: IndexSearcherCacheWarm.patch
>
>
> (From LUCENE-252) 
> nutch uses several index servers, and the search results from these servers 
> are merged using a dedup field for for deleting duplicates. The values from 
> this field is cached by Lucene's FieldCachImpl. The default is the site 
> field, which is indexed and tokenized. However for a Tokenized Field (for 
> example "url" in nutch), FieldCacheImpl returns an array of Terms rather that 
> array of field values, so dedup'ing becomes faulty. Current FieldCache 
> implementation does not respect tokenized fields , and as described above 
> caches only terms. 
> So in the situation that we are searching using "url" as the dedup field, 
> when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
> the url (such as "www" or "com") rather that the whole url. This prevents 
> using tokenized fields in the dedup field. 
> I have written a patch for lucene and attached it in 
> http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
> aforementioned issue about tokenized field caching. However building such a 
> cache for about 1.5M documents takes 20+ secs. The code in 
> IndexSearcher.translateHits() starts with
> if (dedupField != null) 
>       dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
> and for the first call of search in IndexSearcher, cache is built. 
> Long story short, i have written a patch against IndexSearcher, which in 
> constructor warms-up the caches of wanted fields(configurable). I think we 
> should vote for LUCENE-252, and then commit the above patch with the last 
> version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

Reply via email to