[ 
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675315#action_12675315
 ] 

Doğacan Güney commented on NUTCH-684:
-------------------------------------

I wasn't thinking of putting this in for 1.0, but if people want this feature I 
will ready it for 1.0

bq.    *  there is a silent assumption that Solr schema uses "id" field as 
unique key, and that this field contains the URL of the document. First, 
shouldn't this be "url" field? Because as far as I can see the field name "id" 
is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed 
something. At least this assumption should be spelled out in javadocs, both on 
the indexing side and on the dedup side. (Actually, we should have added an 
example of the minimum required Solr schema when the original Nutch/Solr 
integration was committed)

    * field names should be constants and not magic literals, they should come 
either from o.a.n.metadata.Nutch or be defined in SolrConstants.

This is something I have been thinking for a while. My assumption was that you 
didn't have to use "url" field in
your solr server as the unique field so I added an extra "id" field (which in 
NUTCH-442's schema.xml is copied from "url"). But I am no longer sure the extra 
cost of a field is worth the flexibility.

I agree with you that we should have an solr schema xml somewhere in our 
codebase that is officially blessed. I guess NUTCH-442's schema is a good 
starting point for that but I am open to suggestions. I will create a new issue 
for it.

bq.  SolrServer.deleteById() creates and sends UpdateRequest containing just 
this single id. This is inefficient, especially in our case where the number of 
deletes may be significant. Perhaps this patch works sufficiently well for now, 
but it should be improved (either here or in a separate issue) by using a 
single UpdateRequest per reduce task, and calling 
SolrServer.request(UpdateRequest) with the accumulated id-s.

Good point. I will send an improved patch.


> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, 
> solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, 
> duplicate deletion feature (based on digests) is only available in lucene. It 
> should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to