[ 
https://issues.apache.org/jira/browse/NUTCH-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398836#comment-13398836
 ] 

Markus Jelsma commented on NUTCH-1407:
--------------------------------------

We usually filter subscribers by host or a small group of hosts. This is, 
however, not feasible for subscribers with millions of sub domains. It is, in 
Solr, possible to achieve with copyFields and some regular expressions or a 
custom update processor but that is cumbersome. Doing it with Nutch and URLUtil 
has also the advantage that it understands domains with more than one 
extension/suffix.
                
> BasicIndexingFilter to optionally add domain field
> --------------------------------------------------
>
>                 Key: NUTCH-1407
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1407
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1407-1.6-1.patch
>
>
> The basic indexing filter already adds the host field to a NutchDocument but 
> no domain field. In Solr you can copyField a host field and obtain a domain 
> field but this is a bit cumbersome and not very user friendly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to