Hi all, 

I am looking into fixing some very weird behavior of the file protocol.
I am using 0.8.

Researching this topic I found 
http://www.mail-archive.com/[email protected]/msg06536.html
and
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

I am on Ubuntu but I have the same problem that nutch is going down the
tree (including parents) and not up (including children from the root
url).

I have in urls/nutch:
file:///home/thorsten/src/BOJA/repositories/boja/

and my crawl-urlfilter.txt looks like:
-^(http|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|
mov|MOV|exe|png)$

[EMAIL PROTECTED]

-.*(/.+?)/.*?\1/.*?\1/

# accept filepath
+^file:///home/thorsten/src/BOJA(.*)
-^file:/(.*).svn

I patched org.apache.nutch.protocol.file.FileResponse like described in
the folge2 site, recompiled (ant clean; ant) but still it is fetching
down and not up.

Can somebody give me some hints how to fix that?

Further I would vote to make the fetch-parents optional and defined per
a property whether I would like this not very intuitive "feature".

TIA for any feedback.

salu2



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to