[
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated NUTCH-445:
--------------------------------
Attachment: index_query_domain_v1.2.patch
This patch is an update of the previous three patches.
The patch
1. contains TranslatingRawFieldQueryFilter as an abstract implementation for
searching certain fields in the index with a different query fieldname.
2. index-basic indexes the domain and all "super domains " in the domain field.
3.query-site is changed so that site:<site_name> will search domain:<site_name>
By this plugin we can search site:apache.org, and get results from
http://issues.apache.org, etc. or we can search site:com to retrieve all .com
domains.
> Domain İndexing / Query Filter
> ------------------------------
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch,
> index_query_domain_v1.1.patch, index_query_domain_v1.2.patch,
> TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the
> subdomains. Indexing and Searching the domains are important for intuitive
> behavior.
> From DomainIndexingFilter javadoc :
> Adds the domain(hostname) and all super domains to the index.
> * <br> For http://lucene.apache.org/nutch/ the
> * following will be added to the index : <br>
> * <ul>
> * <li>lucene.apache.org </li>
> * <li>apache</li>
> * <li>org </li>
> * </ul>
> * All hostnames are domain names, but not all the domain names are
> * hostnames. In the above example hostname lucene is a
> * subdomain of apache.org, which is itself a subdomain of
> * org <br>
> *
>
> Currently Basic indexing filter indexes the hostname in the site field, and
> query-site plugin
> allows to search in the site field. However site:apache.org will not return
> http://lucene.apache.org
> By indexing the domain, we can be able to search domains. Unlike
> the site field (indexed by BasicIndexingFilter) search, searching the
> domain field allows us to retrieve lucene.apache.org to the query
> apache.org.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers