[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Hoss Man (JIRA) Mon, 23 Feb 2009 11:58:29 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676041#action_12676041
 ]


Hoss Man commented on SOLR-799:
-------------------------------

The separation of concerns between schema.xml and solrconfig.xml has always 
been...

 * schema.xml: what is the data, what is it's nature, what are it's intrinsic 
properties?
 * solrconfig.xml: what can people do with your data, how can they use it?

fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are 
(in theory) intrinsic to the nature of your data regardless of where a given 
document comes from: 
 * documents should only have one author
 * categoryName should always be tokenized in a particular way
 * prices need to sort numericly not lexigraphicallyy
 * any text indexed in the shortSummary field shoudl also be indexed in the 
searchableAbstract field
 * etc...

request handlers that dictate how people can use the data are specified in 
solrconfig.xml -- when searching data request handlers (which may leverage 
search componets) dictate what a user is allowed to get/see;  when modifying an 
index request handlers (which may leverage update processors) dictate what data 
is allowed to come from various sources and in what formats.

In short: as far as document indexing goes, the options configured in 
solrconfig.xml specify how to "build up" a Document object from user input, 
while the options in schema.xml specify how to "tear it down" into it's 
individual terms and values for indexing.

With the near duplicate detection code, it is the schema's job to say which 
fields can exist in the input documents, including a signature field --  but it 
is the solrconfig's job to decide how to compute that signature field ... after 
all: the computation might be different depending on the source of the data 
(ie: different processor chains could be configured for different request 
handlers)

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
> SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to