Thank you very much.
It really worked!!!!
Like this
+http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
-.*
see:
http://www.mail-archive.com/[email protected]/msg00479.html
Dima Mazmanov wrote:
I'm not adding urls into urlfilter files.
Besides, I still don't understand how to allow only one zone in
urlfilter.
Let's say I want to index only ".ge" zone.
Which one of the following filters is correct?
+^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
+^http://([a-z0-9\-\.]*\.)*.ge/
+^http://([a-z0-9\-\.])*.ge/
+^http://www\..*\.ge/
+^http://www\..*\.*\.ge/
By the way if the site you are indexing is dynamic you may just
disallow to index
www.bbc.co.uk and index only second one.
So what filter settings do you use?
Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/>
and
since this site is dynamic, content might bee different.
Have the same problem myself :-(
-----------------------------------
Well my script already contains this command....
Run bin/nutch dedup segments dedup.tmp
Dima Mazmanov wrote:
Hi all!! I'm running on nutch-0.7.1.
Here is result of my search.
ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
Web Site Our web site has new look and ... link on the ...
http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
Software Design Homepage [html] - 30.2 k - ... Look of our Web
Site Our web site has new look and ... link on the ...
http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
Software Design Homepage [html] - 30.2 k - ... Look of our Web
Site Our web site has new look and ... link on the ...
http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
Software Design Homepage [html] - 30.2 k - ... Look of our Web
Site Our web site has new look and ... link on the ...
http://www.argosoft.org/rootpages/Default.aspx (Cached)
As you can see one result is shown multiple times.
Why so? What is the difference between these links? I don't
see any..
So, how can I avoid this problem?
Thanks, Regards, Dima
__________ NOD32 1.1497 (20060419) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com
__________ NOD32 1.1497 (20060419) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com
--
Regards,
Dima mailto:[EMAIL PROTECTED]
__________ NOD32 1.1497 (20060419) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general