-----Messaggio originale----- Da: Renaud Richardet [mailto:[EMAIL PROTECTED] Inviato: venerdì 21 luglio 2006 22.24 A: nutch-user@lucene.apache.org Oggetto: Re: Hadoop and Recrawl
Hi Roberto, Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to Matthew Holt) HTH, Renaud Info wrote: > Hi List > I try to use this script with hadoop but don't work. > I try to change ls with bin/hadoop dfs -ls > But the script don't work because is ls -d and don't ls only. > Someone can help me > Best Regards > Roberto Navoni > > -----Messaggio originale----- > Da: Matthew Holt [mailto:[EMAIL PROTECTED] > Inviato: venerdì 21 luglio 2006 18.58 > A: nutch-user@lucene.apache.org > Oggetto: Re: Recrawl script for 0.8.0 completed... > > Lourival Júnior wrote: > >> I thing it wont work with me because i'm using the Nutch version 0.7.2. >> Actually I use this script (some comments are in Portuguese): >> >> #!/bin/bash >> >> # A simple script to run a Nutch re-crawl >> # Fonte do script: >> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html >> >> #{ >> >> if [ -n "$1" ] >> then >> crawl_dir=$1 >> else >> echo "Usage: recrawl crawl_dir [depth] [adddays]" >> exit 1 >> fi >> >> if [ -n "$2" ] >> then >> depth=$2 >> else >> depth=5 >> fi >> >> if [ -n "$3" ] >> then >> adddays=$3 >> else >> adddays=0 >> fi >> >> webdb_dir=$crawl_dir/db >> segments_dir=$crawl_dir/segments >> index_dir=$crawl_dir/index >> >> #Para o serviço do TomCat >> #net stop "Apache Tomcat" >> >> # The generate/fetch/update cycle >> for ((i=1; i <= depth ; i++)) >> do >> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays >> segment=`ls -d $segments_dir/* | tail -1` >> bin/nutch fetch $segment >> bin/nutch updatedb $webdb_dir $segment >> echo >> echo "Fim do ciclo $i." >> echo >> done >> >> # Update segments >> echo >> echo "Atualizando os Segmentos..." >> echo >> mkdir tmp >> bin/nutch updatesegs $webdb_dir $segments_dir tmp >> rm -R tmp >> >> # Index segments >> echo "Indexando os segmentos..." >> echo >> for segment in `ls -d $segments_dir/* | tail -$depth` >> do >> bin/nutch index $segment >> done >> >> # De-duplicate indexes >> # "bogus" argument is ignored but needed due to >> # a bug in the number of args expected >> bin/nutch dedup $segments_dir bogus >> >> # Merge indexes >> #echo "Unindo os segmentos..." >> #echo >> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir >> >> chmod 777 -R $index_dir >> >> #Inicia o serviço do TomCat >> #net start "Apache Tomcat" >> >> echo "Fim." >> >> #} > recrawl.log 2>&1 >> >> How you suggested I used the touch command instead stops the tomcat. >> However >> I get that error posted in previous message. I'm running nutch in windows >> plataform with cygwin. I only get no errors when I stops the tomcat. I >> use >> this command to call the script: >> >> ./recrawl crawl-legislacao 1 >> >> Could you give me more clarifications? >> >> Thanks a lot! >> >> On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> >>> Lourival Júnior wrote: >>> >>>> Hi Renaud! >>>> >>>> I'm newbie with shell scripts and I know stops tomcat service is >>>> >>> not the >>> >>>> better way to do this. The problem is, when a run the re-crawl script >>>> with >>>> tomcat started I get this error: >>>> >>>> 060721 132224 merging segment indexes to: crawl-legislacao2\index >>>> Exception in thread "main" java.io.IOException: Cannot delete _0.f0 >>>> at >>>> org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) >>>> at >>>> >>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) >>> >>>> at >>>> org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java >>>> :141) >>>> at >>>> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225) >>>> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java >>>> >>> :92) >>> >>>> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java >>>> >>> :160) >>> >>>> So, I want another way to re-crawl my pages without this error and >>>> without >>>> restarting the tomcat. Could you suggest one? >>>> >>>> Thanks a lot! >>>> >>>> >>>> >>> Try this updated script and tell me what command exactly you run to call >>> the script. Let me know the error message then. >>> >>> Matt >>> >>> >>> #!/bin/bash >>> >>> # Nutch recrawl script. >>> # Based on 0.7.2 script at >>> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html >>> > > >>> # Modified by Matthew Holt >>> >>> if [ -n "$1" ] >>> then >>> nutch_dir=$1 >>> else >>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" >>> echo "servlet_path - Path of the nutch servlet (i.e. >>> /usr/local/tomcat/webapps/ROOT)" >>> echo "crawl_dir - Name of the directory the crawl is located in." >>> echo "[depth] - The link depth from the root page that should be >>> crawled." >>> echo "[adddays] - Advance the clock # of days for fetchlist >>> generation." >>> exit 1 >>> fi >>> >>> if [ -n "$2" ] >>> then >>> crawl_dir=$2 >>> else >>> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" >>> echo "servlet_path - Path of the nutch servlet (i.e. >>> /usr/local/tomcat/webapps/ROOT)" >>> echo "crawl_dir - Name of the directory the crawl is located in." >>> echo "[depth] - The link depth from the root page that should be >>> crawled." >>> echo "[adddays] - Advance the clock # of days for fetchlist >>> generation." >>> exit 1 >>> fi >>> >>> if [ -n "$3" ] >>> then >>> depth=$3 >>> else >>> depth=5 >>> fi >>> >>> if [ -n "$4" ] >>> then >>> adddays=$4 >>> else >>> adddays=0 >>> fi >>> >>> # Only change if your crawl subdirectories are named something different >>> webdb_dir=$crawl_dir/crawldb >>> segments_dir=$crawl_dir/segments >>> linkdb_dir=$crawl_dir/linkdb >>> index_dir=$crawl_dir/index >>> >>> # The generate/fetch/update cycle >>> for ((i=1; i <= depth ; i++)) >>> do >>> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays >>> segment=`ls -d $segments_dir/* | tail -1` >>> bin/nutch fetch $segment >>> bin/nutch updatedb $webdb_dir $segment >>> done >>> >>> # Update segments >>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir >>> >>> # Index segments >>> new_indexes=$crawl_dir/newindexes >>> #ls -d $segments_dir/* | tail -$depth | xargs >>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* >>> >>> # De-duplicate indexes >>> bin/nutch dedup $new_indexes >>> >>> # Merge indexes >>> bin/nutch merge $index_dir $new_indexes >>> >>> # Tell Tomcat to reload index >>> touch $nutch_dir/WEB-INF/web.xml >>> >>> # Clean up >>> rm -rf $new_indexes >>> >>> >>> >> > Oh yea, you're right the one i sent out was for 0.8.... you should just > be able to put this at the end of your script.. > > # Tell Tomcat to reload index > touch $nutch_dir/WEB-INF/web.xml > > and fill in the appropriate path of course. > gluck > matt > > > > -- Renaud Richardet COO America Wyona Inc. - Open Source Content Management - Apache Lenya office +1 857 776-3195 mobile +1 617 230 9112 renaud.richardet <at> wyona.com http://www.wyona.com -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006