Thank you very much.
It really worked!!!!

Like this

+http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
-.*

see:
http://www.mail-archive.com/[email protected]/msg00479.html

Dima Mazmanov wrote:
I'm not adding urls into urlfilter files.
Besides, I still don't understand how to allow only one zone in urlfilter.
Let's say I want to index only ".ge" zone.
Which one of the following filters is correct?

+^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
+^http://([a-z0-9\-\.]*\.)*.ge/
+^http://([a-z0-9\-\.])*.ge/
+^http://www\..*\.ge/
+^http://www\..*\.*\.ge/

By the way if the site you are indexing is dynamic you may just disallow to index
www.bbc.co.uk and index only second one.


So what filter settings do you use?
Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/>
and
since this site is dynamic, content might bee different.
Have the same problem myself :-(




-----------------------------------
Well my script already contains this command....




   Run bin/nutch dedup segments dedup.tmp


   Dima Mazmanov wrote:

       Hi all!! I'm running on nutch-0.7.1.

       Here is result of my search.


       ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
       Web Site Our web site has new look and ... link on the ...
       http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
       Software Design Homepage [html] - 30.2 k - ... Look of our Web
       Site Our web site has new look and ... link on the ...
       http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
       Software Design Homepage [html] - 30.2 k - ... Look of our Web
       Site Our web site has new look and ... link on the ...
       http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
       Software Design Homepage [html] - 30.2 k - ... Look of our Web
       Site Our web site has new look and ... link on the ...
       http://www.argosoft.org/rootpages/Default.aspx (Cached)

       As you can see one result is shown multiple times.
       Why so? What is the difference between these links? I don't
see any..
       So, how can I avoid this problem?
       Thanks, Regards, Dima






__________ NOD32 1.1497 (20060419) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com







__________ NOD32 1.1497 (20060419) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com




--
Regards,
Dima                          mailto:[EMAIL PROTECTED]




__________ NOD32 1.1497 (20060419) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to