Hello friends: I'm crawling with nutch, and I don't to craw images at all, and I don't to craw urls with "?" or strange characters . When I looking for *.gif. This is a fragment of my solr's search
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">73</int><lst name="params"><str name="q">*.gif</str></lst></lst><result name="response" numFound="352" start="0" maxScore="1.0"><doc><str name="content"/><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">85cb9286b70bdee25b40433645b9ff72</str><date name="tstamp">2013-11-14T16:18:22.029Z</date><str name="id">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><str name="url">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><long name="_version_">1451712741146361856</long></doc><doc><str name="content"/><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">292408955f4aae8eec90e0ce55fbd739</str><date name="tstamp">2013-11-14T16:39:27.359Z</date><str name="id">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><str name="url">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><long name="_version_">1451712741161041920</long></doc> </str><str name="title">Forum UCLV • Preguntas Frecuentes</str><str name="segment">20131114152100</str><float name="boost">1.0</float><str name="digest">a8c190fb3d22f71d47b67647bc814cba</str><date name="tstamp">2013-11-14T16:41:55.548Z</date><str name="id">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><str name="url">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><long name="_version_">1451712743156482048</long></doc><doc> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ nutch-site.xml: <configuration> <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>plugin.includes</name> <value>protocol-(http|ftp)|urlfilter-validator|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value> </property> </configuration> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- regex-urlfilter.txt: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +^http://([a-z0-9]*\.).uclv.edu.cu/ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I ran nutch with this command: bin/crawl urls/seed.txt Testcrawl/ http://solr1:8983/solr 2 What is wrong in my conf files???? La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. Fundada el 30 de noviembre de 1952. Visítenos en: http://www.uclv.edu.cu Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. http://www.congresouniversidad.cu/

