[ http://issues.apache.org/jira/browse/NUTCH-47?page=comments#action_63322 ] byron miller commented on NUTCH-47: -----------------------------------
I will be adding another ~20 million pages to my index through the next few days, will test out this query to see if these naturally filter out as other ranked pages come in. When i run this search on yahoo/google i see some of the major sections such as fedora.redhat.com and rhn.redhat.com but i don't necessarily see all of the individual country codes. I wonder if some of these are filtered out on Yahoo/Google because the default language is set to en and as you can see some of the sites showing up now are non-enlish. > Configure host filter to do wildcard prefixes - *.redhat.com > ------------------------------------------------------------ > > Key: NUTCH-47 > URL: http://issues.apache.org/jira/browse/NUTCH-47 > Project: Nutch > Type: Improvement > Components: searcher > Environment: Linux > Reporter: byron miller > Priority: Minor > > Right now you can configure the max results per host for query response, but > that seems limited to exact host matches such as "www.redhat.com". > In many ways it would be nice to include the capability to match hosts by > wildcard. > For example search for redhat on mozdex.com: > http://www.mozdex.com/search.jsp?query=redhat > And you will see: > www.apac.redhat.com > www.europe.redhat.com > www.in.redhat.com > Could this be fixed so that *.redhat.com is under "find more sources under > redhat.com" or something like that? > I may be able to tweak the other processes, but i can envision a problem of > people creating www1 www2 www3 or using other country codes for the > same/similar content filling up pages of serps for what could be other > relevent information. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
