[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated NUTCH-445: -------------------------------- Attachment: index_query_domain_v1.2.patch This patch is an update of the previous three patches. The patch 1. contains TranslatingRawFieldQueryFilter as an abstract implementation for searching certain fields in the index with a different query fieldname. 2. index-basic indexes the domain and all "super domains " in the domain field. 3.query-site is changed so that site:<site_name> will search domain:<site_name> By this plugin we can search site:apache.org, and get results from http://issues.apache.org, etc. or we can search site:com to retrieve all .com domains. > Domain İndexing / Query Filter > ------------------------------ > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, > TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * <br> For http://lucene.apache.org/nutch/ the > * following will be added to the index : <br> > * <ul> > * <li>lucene.apache.org </li> > * <li>apache</li> > * <li>org </li> > * </ul> > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org <br> > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.