[Nutch-dev] [jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

Enis Soztutar (JIRA) Thu, 08 Mar 2007 00:32:51 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479262
 ]


Enis Soztutar commented on NUTCH-455:
-------------------------------------

(from LUCENE-252)

In nutch we have 3 options : 1st is to disallow deleting duplicates on 
tokenized fields(due to FieldCache), 2nd is to index the tokenized field 
twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above 
patch and warm the cache initially in the index servers.

I am in favor of the 3rd option. 
I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more 
sensible. 

> dedup on tokenized fields is faulty
> -----------------------------------
>
>                 Key: NUTCH-455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-455
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>             Fix For: 0.9.0
>
>         Attachments: IndexSearcherCacheWarm.patch
>
>
> (From LUCENE-252) 
> nutch uses several index servers, and the search results from these servers 
> are merged using a dedup field for for deleting duplicates. The values from 
> this field is cached by Lucene's FieldCachImpl. The default is the site 
> field, which is indexed and tokenized. However for a Tokenized Field (for 
> example "url" in nutch), FieldCacheImpl returns an array of Terms rather that 
> array of field values, so dedup'ing becomes faulty. Current FieldCache 
> implementation does not respect tokenized fields , and as described above 
> caches only terms. 
> So in the situation that we are searching using "url" as the dedup field, 
> when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
> the url (such as "www" or "com") rather that the whole url. This prevents 
> using tokenized fields in the dedup field. 
> I have written a patch for lucene and attached it in 
> http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
> aforementioned issue about tokenized field caching. However building such a 
> cache for about 1.5M documents takes 20+ secs. The code in 
> IndexSearcher.translateHits() starts with
> if (dedupField != null) 
>       dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
> and for the first call of search in IndexSearcher, cache is built. 
> Long story short, i have written a patch against IndexSearcher, which in 
> constructor warms-up the caches of wanted fields(configurable). I think we 
> should vote for LUCENE-252, and then commit the above patch with the last 
> version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

Reply via email to