Lourival. I have typically seen the same issues on a cygwin/windows setup. The only thing that worked for me was shutting down and restarting tomcat, instead of just reloading the context. On linux now I don't have these issues anymore.
Rgrds, Thomas On 7/21/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
Ok. However a few minutes ago I ran the script exactly you said and I still get this error: Exception in thread "main" java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160) I dont know but I thing it occurs because nutch tries to delete some file that tomcat loads to the memory, giving permission access error. Any idea? On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > > Lourival Júnior wrote: > > I thing it wont work with me because i'm using the Nutch version 0.7.2. > > Actually I use this script (some comments are in Portuguese): > > > > #!/bin/bash > > > > # A simple script to run a Nutch re-crawl > > # Fonte do script: > > > http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html > > > > #{ > > > > if [ -n "$1" ] > > then > > crawl_dir=$1 > > else > > echo "Usage: recrawl crawl_dir [depth] [adddays]" > > exit 1 > > fi > > > > if [ -n "$2" ] > > then > > depth=$2 > > else > > depth=5 > > fi > > > > if [ -n "$3" ] > > then > > adddays=$3 > > else > > adddays=0 > > fi > > > > webdb_dir=$crawl_dir/db > > segments_dir=$crawl_dir/segments > > index_dir=$crawl_dir/index > > > > #Para o serviço do TomCat > > #net stop "Apache Tomcat" > > > > # The generate/fetch/update cycle > > for ((i=1; i <= depth ; i++)) > > do > > bin/nutch generate $webdb_dir $segments_dir -adddays $adddays > > segment=`ls -d $segments_dir/* | tail -1` > > bin/nutch fetch $segment > > bin/nutch updatedb $webdb_dir $segment > > echo > > echo "Fim do ciclo $i." > > echo > > done > > > > # Update segments > > echo > > echo "Atualizando os Segmentos..." > > echo > > mkdir tmp > > bin/nutch updatesegs $webdb_dir $segments_dir tmp > > rm -R tmp > > > > # Index segments > > echo "Indexando os segmentos..." > > echo > > for segment in `ls -d $segments_dir/* | tail -$depth` > > do > > bin/nutch index $segment > > done > > > > # De-duplicate indexes > > # "bogus" argument is ignored but needed due to > > # a bug in the number of args expected > > bin/nutch dedup $segments_dir bogus > > > > # Merge indexes > > #echo "Unindo os segmentos..." > > #echo > > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir > > > > chmod 777 -R $index_dir > > > > #Inicia o serviço do TomCat > > #net start "Apache Tomcat" > > > > echo "Fim." > > > > #} > recrawl.log 2>&1 > > > > How you suggested I used the touch command instead stops the tomcat. > > However > > I get that error posted in previous message. I'm running nutch in > windows > > plataform with cygwin. I only get no errors when I stops the tomcat. I > > use > > this command to call the script: > > > > ./recrawl crawl-legislacao 1 > > > > Could you give me more clarifications? > > > > Thanks a lot! > > > > On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> > >> Lourival Júnior wrote: > >> > Hi Renaud! > >> > > >> > I'm newbie with shell scripts and I know stops tomcat service is > >> not the > >> > better way to do this. The problem is, when a run the re-crawl script > >> > with > >> > tomcat started I get this error: > >> > > >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index > >> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0 > >> > at > >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) > >> > at > >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) > >> > at > >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java > >> > :141) > >> > at > >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225) > >> > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java > >> :92) > >> > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java > >> :160) > >> > > >> > So, I want another way to re-crawl my pages without this error and > >> > without > >> > restarting the tomcat. Could you suggest one? > >> > > >> > Thanks a lot! > >> > > >> > > >> Try this updated script and tell me what command exactly you run to > call > >> the script. Let me know the error message then. > >> > >> Matt > >> > >> > >> #!/bin/bash > >> > >> # Nutch recrawl script. > >> # Based on 0.7.2 script at > >> > http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html > >> > >> # Modified by Matthew Holt > >> > >> if [ -n "$1" ] > >> then > >> nutch_dir=$1 > >> else > >> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" > >> echo "servlet_path - Path of the nutch servlet (i.e. > >> /usr/local/tomcat/webapps/ROOT)" > >> echo "crawl_dir - Name of the directory the crawl is located in." > >> echo "[depth] - The link depth from the root page that should be > >> crawled." > >> echo "[adddays] - Advance the clock # of days for fetchlist > >> generation." > >> exit 1 > >> fi > >> > >> if [ -n "$2" ] > >> then > >> crawl_dir=$2 > >> else > >> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" > >> echo "servlet_path - Path of the nutch servlet (i.e. > >> /usr/local/tomcat/webapps/ROOT)" > >> echo "crawl_dir - Name of the directory the crawl is located in." > >> echo "[depth] - The link depth from the root page that should be > >> crawled." > >> echo "[adddays] - Advance the clock # of days for fetchlist > >> generation." > >> exit 1 > >> fi > >> > >> if [ -n "$3" ] > >> then > >> depth=$3 > >> else > >> depth=5 > >> fi > >> > >> if [ -n "$4" ] > >> then > >> adddays=$4 > >> else > >> adddays=0 > >> fi > >> > >> # Only change if your crawl subdirectories are named something > different > >> webdb_dir=$crawl_dir/crawldb > >> segments_dir=$crawl_dir/segments > >> linkdb_dir=$crawl_dir/linkdb > >> index_dir=$crawl_dir/index > >> > >> # The generate/fetch/update cycle > >> for ((i=1; i <= depth ; i++)) > >> do > >> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays > >> segment=`ls -d $segments_dir/* | tail -1` > >> bin/nutch fetch $segment > >> bin/nutch updatedb $webdb_dir $segment > >> done > >> > >> # Update segments > >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir > >> > >> # Index segments > >> new_indexes=$crawl_dir/newindexes > >> #ls -d $segments_dir/* | tail -$depth | xargs > >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* > >> > >> # De-duplicate indexes > >> bin/nutch dedup $new_indexes > >> > >> # Merge indexes > >> bin/nutch merge $index_dir $new_indexes > >> > >> # Tell Tomcat to reload index > >> touch $nutch_dir/WEB-INF/web.xml > >> > >> # Clean up > >> rm -rf $new_indexes > >> > >> > > > > > Oh yea, you're right the one i sent out was for 0.8.... you should just > be able to put this at the end of your script.. > > # Tell Tomcat to reload index > touch $nutch_dir/WEB-INF/web.xml > > and fill in the appropriate path of course. > gluck > matt > -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]