I am running Nutch 0.9 and I have a website where some of the urls should be ommited.
I have added the following exceptions in regex-urlfilter.txt -.*forside$ -.*frontpage$ -.*/js/.* -.*/resources/.* -.*/text/.* -.*sdc.arena.no.* -.*error.* -.*/framework/.* -.*/tridion.* -.*/sitemap.* -.*/nettsidekart.* -.*/airport/.*/airports.* -.*/lufthavn/.*/lufthavner.* -.*://$ I have testet the filter by running this command $ cat /cygdrive/c/tmp/nutch_urls | bin/nutch org/apache/nutch/net/URLFilterChecker -filterName org.apache.nutch.urlfil ter.regex.RegexURLFilter |grep -e -http -http://sgm634.lv.no:13101/avinor/text/javascript -http://sgm634.lv.no:13101/avinor/sdc.arena.no -http://sgm634.lv.no:13101/framework/skins/avinor/js/showHide.js -http://sgm634.lv.no:13101/avinor/:// -http://sgm634.lv.no:13101/framework/skins/avinor/js/dojo.js -http://sgm634.lv.no:13101/framework/skins/avinor/js/common.js -http://sgm634.lv.no:13101/avinor/text/css And as you can se it filters out the stuff I don not want. The only problem is that whenever I run the nutch crawl command or if I recrawl the urls seems to pop up after all. Example snibbit (the ones that should not be there are marked with a minus in front): fetching http://sgm634.lv.no:13101/avinor/trafikk fetching http://sgm634.lv.no:13101/lufthavn/gressholmen fetching http://sgm634.lv.no:13101/lufthavn/namsos fetching http://sgm634.lv.no:13101/lufthavn/haugesund fetching http://sgm634.lv.no:13101/lufthavn/rost fetching http://sgm634.lv.no:13101/lufthavn/rorvik fetching http://sgm634.lv.no:13101/lufthavn/lista fetching http://sgm634.lv.no:13101/lufthavn/kristiansand fetching http://sgm634.lv.no:13101/avinor/karriere fetching http://sgm634.lv.no:13101/lufthavn/bardufoss fetching http://sgm634.lv.no:13101/lufthavn/kirkenes fetching http://sgm634.lv.no:13101/lufthavn/harstad fetching http://sgm634.lv.no:13101/lufthavn/stokmarknes fetching http://sgm634.lv.no:13101/lufthavn/lakselv fetching http://sgm634.lv.no:13101/avinor/omavinor fetching http://sgm634.lv.no:13101/lufthavn/fagernes fetching http://sgm634.lv.no:13101/lufthavn/mehamn fetching http://sgm634.lv.no:13101/avinor/rapporter -fetching http://sgm634.lv.no:13101/avinor/text/javascript fetching http://sgm634.lv.no:13101/lufthavn/stavanger fetching http://sgm634.lv.no:13101/lufthavn/roros fetching http://sgm634.lv.no:13101/lufthavn/sorkjosen -fetching http://sgm634.lv.no:13101/avinor/sdc.arena.no fetching http://sgm634.lv.no:13101/lufthavn/vardo fetching http://sgm634.lv.no:13101/avinor/miljo fetching http://sgm634.lv.no:13101/lufthavn/bronnoysund fetching http://sgm634.lv.no:13101/avinor/sikkerhet fetching http://sgm634.lv.no:13101/avinor/omavinor/Kontakt oss fetching http://sgm634.lv.no:13101/lufthavn/hammerfest fetching http://sgm634.lv.no:13101/avinor/sporsmal fetching http://sgm634.lv.no:13101/lufthavn/sogndal -fetching http://sgm634.lv.no:13101/avinor/:// fetching http://sgm634.lv.no:13101/lufthavn/bodo fetching http://sgm634.lv.no:13101/lufthavn/vadso fetching http://sgm634.lv.no:13101/lufthavn/sandnessjoen fetching http://sgm634.lv.no:13101/lufthavn/narvik fetching http://sgm634.lv.no:13101/lufthavn/honningsvag -fetching http://sgm634.lv.no:13101/avinor/text/css fetching http://sgm634.lv.no:13101/lufthavn/alesund fetching http://sgm634.lv.no:13101/lufthavn/varoy fetching http://sgm634.lv.no:13101/lufthavn/andoya fetching http://sgm634.lv.no:13101/lufthavn/trondheim fetching http://sgm634.lv.no:13101/avinor/forside fetching http://sgm634.lv.no:13101/lufthavn/tromso fetching http://sgm634.lv.no:13101/lufthavn/sandane fetching http://sgm634.lv.no:13101/lufthavn/kristiansund fetching http://sgm634.lv.no:13101/avinor/pressesenter fetching http://sgm634.lv.no:13101/lufthavn/leknes fetching http://sgm634.lv.no:13101/lufthavn/floro fetching http://sgm634.lv.no:13101/avinor/lufthavner fetching http://sgm634.lv.no:13101/lufthavn/moirana Can anyone pleas tell me what am I doing wrong? It struck me that I might be using the wrong file and that all regex exceptions should be in crawl-urlfilter.txt, but I do not thing that is correct. Thanks, Ronny -----Opprinnelig melding----- Fra: Naess, Ronny [mailto:[EMAIL PROTECTED] Sendt: 16. mai 2007 15:18 Til: [EMAIL PROTECTED] Emne: Re: Reindex and initialization I found this script and motified it slightly. http://wiki.apache.org/nutch/IntranetRecrawl#head-b16709cbbd77ae6c80d742 ee69383142cefb8683 The script takes care of the reinstallation of the index by touching the web.xml file in the webapp. Doing this reloads the whole webapp, doesn't it? Is that the only way to reload the index, by reloading the webapp? -Ronny -----Opprinnelig melding----- Fra: Naess, Ronny [mailto:[EMAIL PROTECTED] Sendt: 15. mai 2007 12:13 Til: [EMAIL PROTECTED] Emne: Re: Reindex and initialization It showed that I had some issues with jdk versions after all. I added NUTCH_JAVA_HOME pointing at jdk 1.5 and that seemed to do the trick. Some other issus popped up but I recon that has to do with me using nutch 0.9. Still wondering about the initialization of index when client is running. I do not want to restart the webclient. -Ronny -----Opprinnelig melding----- Fra: Naess, Ronny [mailto:[EMAIL PROTECTED] Sendt: 15. mai 2007 10:26 Til: [EMAIL PROTECTED] Emne: Reindex and initialization Hi. I want to have a script for reindexing. I copied the recrawl script made by the author of this tutorial http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htm l I am running into some trouble with UnsupportedClassVersionError (Unsupported major.minor version 49.0). I have tried both java 1.5 and 1.4. I am using Nutch 0.9. I there are other or better ways to reindex I will be happy for any hints or help in that area. Also, is the problem with reinit of new index still a problem as earlier (0.7) where one solution was to reinit the webclient. Restart of webclient is not an option for us since we must have high uptime/availiability. Does anyone know if this is fixed or if there is a solution for reinit in nutch 0.9? -Ronny !DSPAM:464b04bd173231550420230! ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
