[jira] Commented: (NUTCH-684) Dedup support for Solr

Andrzej Bialecki (JIRA) Fri, 20 Feb 2009 02:03:28 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675309#action_12675309
 ]


Andrzej Bialecki  commented on NUTCH-684:
-----------------------------------------

A few comments to this patch (and to other closely related classes in 
o.a.n.i.solr):

* we need javadocs in this patch - both class-level and for public methods. The 
class-level javadoc should contain pseudo-code to illustrate the selection 
process (see o.a.n.i.DeleteDuplicates for an example).

* there is a silent assumption that Solr schema uses "id" field as unique key, 
and that this field contains the URL of the document. First, shouldn't this be 
"url" field? Because as far as I can see the field name "id" is not used 
anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. 
At least this assumption should be spelled out in javadocs, both on the 
indexing side and on the dedup side. (Actually, we should have added an example 
of the minimum required Solr schema when the original Nutch/Solr integration 
was committed)

* field names should be constants and not magic literals, they should come 
either from o.a.n.metadata.Nutch or be defined in SolrConstants.

* SolrServer.deleteById() creates and sends UpdateRequest containing just this 
single id. This is inefficient, especially in our case where the number of 
deletes may be significant. Perhaps this patch works sufficiently well for now, 
but it should be improved (either here or in a separate issue) by using a 
single UpdateRequest per reduce task, and calling 
SolrServer.request(UpdateRequest) with the accumulated id-s.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, 
> solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, 
> duplicate deletion feature (based on digests) is only available in lucene. It 
> should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Reply via email to