[ http://issues.apache.org/jira/browse/NUTCH-47?page=comments#action_63322 
]
     
byron miller commented on NUTCH-47:
-----------------------------------

I will be adding another ~20 million pages to my index through the next few 
days, will test out this query to see if these naturally filter out as other 
ranked pages come in.

When i run this search on yahoo/google i see some of the major sections such as 
fedora.redhat.com and rhn.redhat.com but i don't necessarily see all of the 
individual country codes.

I wonder if some of these are filtered out on Yahoo/Google because the default 
language is set to en and as you can see some of the sites showing up now are 
non-enlish.

> Configure host filter to do wildcard prefixes - *.redhat.com
> ------------------------------------------------------------
>
>          Key: NUTCH-47
>          URL: http://issues.apache.org/jira/browse/NUTCH-47
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>  Environment: Linux
>     Reporter: byron miller
>     Priority: Minor

>
> Right now you can configure the max results per host for query response, but 
> that seems limited to exact host matches such as "www.redhat.com".
> In many ways it would be nice to include the capability to match hosts by 
> wildcard.
> For example search for redhat on mozdex.com:
> http://www.mozdex.com/search.jsp?query=redhat
> And you will see:
> www.apac.redhat.com 
> www.europe.redhat.com 
> www.in.redhat.com 
> Could this be fixed so that *.redhat.com is under "find more sources under 
> redhat.com" or something like that?
> I may be able to tweak the other processes, but i can envision a problem of 
> people creating www1 www2 www3 or using other country codes for the 
> same/similar content filling up pages of serps for what could be other 
> relevent information.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to