[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675748#action_12675748
 ] 

Lance Norskog commented on SOLR-799:
------------------------------------

I came into Solr with no search experience and it was quite a learning curve. 
The modular design of the configuration really helped, and we should maintain 
that modularity. There are two different designs: the design of the 
configuration and the design of the implementation. This comment only addresses 
the design of the configuration files.  

The patch as committed moves the specification of one field out of schema.xml 
file to another file. This breaks the modularity of the configurations.  I 
suggest that the files should look like this:

schema.xml:
<field name="signatureField" type="signatureField" indexed="true" 
stored="false" signature="solr.TextProfileSignature" fields="product_name, 
model_t, *_s" />

solrconfig.xml:
<updateRequestProcessorChain name="dedupe">
    <processor 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <string name="signatureField">signatureField</string>
      <bool name="enabled">false</bool>
      <bool name="overwriteDupes">true</bool>
   </processor>
   
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

That is, the design of the signature field should go in schema.xml, and each 
updateRequest section should only describe how it is used with that section's 
declared name. Also, there should be no default field, since every field in the 
schema should be described in schema.xml. 



> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
> SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to