Hello friends:
 I'm crawling with nutch, and I don't to craw images at all, and I don't to 
craw urls with "?" or strange characters . When I looking for *.gif. This is a 
fragment of my solr's search

<response><lst name="responseHeader"><int name="status">0</int><int 
name="QTime">73</int><lst name="params"><str 
name="q">*.gif</str></lst></lst><result name="response" numFound="352" 
start="0" maxScore="1.0"><doc><str name="content"/><str 
name="segment">20131114152100</str><float name="boost">1.0</float><str 
name="digest">85cb9286b70bdee25b40433645b9ff72</str><date 
name="tstamp">2013-11-14T16:18:22.029Z</date><str 
name="id">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><str 
name="url">http://calorm.qf.uclv.edu.cu/Images1/BigPracBar.gif</str><long 
name="_version_">1451712741146361856</long></doc><doc><str name="content"/><str 
name="segment">20131114152100</str><float name="boost">1.0</float><str 
name="digest">292408955f4aae8eec90e0ce55fbd739</str><date 
name="tstamp">2013-11-14T16:39:27.359Z</date><str 
name="id">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><str 
name="url">http://calorm.qf.uclv.edu.cu/Images1/Bigenlbar.gif</str><long 
name="_version_">1451712741161041920</long></doc>
</str><str name="title">Forum UCLV • Preguntas Frecuentes</str><str 
name="segment">20131114152100</str><float name="boost">1.0</float><str 
name="digest">a8c190fb3d22f71d47b67647bc814cba</str><date 
name="tstamp">2013-11-14T16:41:55.548Z</date><str 
name="id">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><str
 
name="url">http://forum.uclv.edu.cu/faq.php?sid=371ada5505649fe6c0155ef3d7bc261e</str><long
 name="_version_">1451712743156482048</long></doc><doc>



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nutch-site.xml:
<configuration>

    <property>
        <name>http.agent.name</name>
        <value>My Nutch Spider</value>
    </property>

    <property>
        <name>plugin.includes</name>
        
<value>protocol-(http|ftp)|urlfilter-validator|parse-(html|tika)|index-(basic|anchor)|indexer-solr</value>
    </property>

</configuration>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
regex-urlfilter.txt:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.).uclv.edu.cu/
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I ran nutch with this command: bin/crawl urls/seed.txt Testcrawl/ 
http://solr1:8983/solr 2

What is wrong in my conf files????

La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. 
Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. 
http://www.congresouniversidad.cu/


Reply via email to