[
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479262
]
Enis Soztutar commented on NUTCH-455:
-------------------------------------
(from LUCENE-252)
In nutch we have 3 options : 1st is to disallow deleting duplicates on
tokenized fields(due to FieldCache), 2nd is to index the tokenized field
twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above
patch and warm the cache initially in the index servers.
I am in favor of the 3rd option.
I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more
sensible.
> dedup on tokenized fields is faulty
> -----------------------------------
>
> Key: NUTCH-455
> URL: https://issues.apache.org/jira/browse/NUTCH-455
> Project: Nutch
> Issue Type: Bug
> Components: searcher
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Fix For: 0.9.0
>
> Attachments: IndexSearcherCacheWarm.patch
>
>
> (From LUCENE-252)
> nutch uses several index servers, and the search results from these servers
> are merged using a dedup field for for deleting duplicates. The values from
> this field is cached by Lucene's FieldCachImpl. The default is the site
> field, which is indexed and tokenized. However for a Tokenized Field (for
> example "url" in nutch), FieldCacheImpl returns an array of Terms rather that
> array of field values, so dedup'ing becomes faulty. Current FieldCache
> implementation does not respect tokenized fields , and as described above
> caches only terms.
> So in the situation that we are searching using "url" as the dedup field,
> when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of
> the url (such as "www" or "com") rather that the whole url. This prevents
> using tokenized fields in the dedup field.
> I have written a patch for lucene and attached it in
> http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the
> aforementioned issue about tokenized field caching. However building such a
> cache for about 1.5M documents takes 20+ secs. The code in
> IndexSearcher.translateHits() starts with
> if (dedupField != null)
> dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
> and for the first call of search in IndexSearcher, cache is built.
> Long story short, i have written a patch against IndexSearcher, which in
> constructor warms-up the caches of wanted fields(configurable). I think we
> should vote for LUCENE-252, and then commit the above patch with the last
> version of lucene.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers