Sorry....when I cut/paste the script below, it didn't comeout proper. So just re-aligned the lines within it. Please review it and let me know where I am going wrong.
Thanks a lot. -----Original Message----- From: Malaviya, Sanjay X Sent: Thursday, May 28, 2009 1:57 PM To: [email protected] Subject: Recrawl not picking up changes to the web site. Please Help!!! My script that is suppose to recrawl and pick up new and modified documents in the web site is not working. It's not throwing any errors but not picking up any changes I make to the files or add new document. The script I have is -- --------------------------------- #!/bin/bash # tomcat_dir=$1 crawl_dir=$2 depth=$3 adddays=$4 topn="-topN $5" # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.6.0_01' # Set the paths nutch_dir='/cygdrive/d/inet/apps/nutch-0.9/bin' crawl_dir='/cygdrive/d/inet/apps/nutch-0.9/crawl' tomcat_dir='/cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9' depth=10 # Only change if your crawl subdirectories are named something different webdb_dir=$crawl_dir/crawldb segments_dir=$crawl_dir/segments linkdb_dir=$crawl_dir/linkdb index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate crawl/crawldb crawl/segments -topN 1000 segment=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb crawl/crawldb crawl/segments done # Merge segments and cleanup unused segments mergesegs_dir=crawl/mergesegs_dir bin/nutch mergesegs crawl/mergesegs_dir -dir crawl/segments for segment in `ls -d crawl/segments/* | tail -10` do echo "Removing Temporary Segment: $segment" rm -rf $segment done cp -R crawl/mergesegs_dir/* crawl/segments rm -rf crawl/mergesegs_dir # Update segments bin/nutch invertlinks crawl/linkdb -dir crawl/segments # Index segments new_indexes=crawl/newindexes segment=`ls -d crawl/segments/* | tail -1` bin/nutch index crawl/newindexes crawl/crawldb crawl/linkdb $segment # De-duplicate indexes bin/nutch dedup crawl/newindexes # Merge indexes bin/nutch merge crawl/index crawl/newindexes # Tell Tomcat to reload index touch /cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9/WEB-INF/web.xml # Clean up rm -rf crawl/newindexes --------------------------------- Sanjay -----Original Message----- From: Kenan Azam [mailto:[email protected]] Sent: Tuesday, May 26, 2009 4:41 PM To: [email protected] Subject: Re: Shell Script to maintain Nutch index here is a url to scripts for nutch 0.8 and 0.9 http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591 c293aead539a017ec7 On Tue, May 26, 2009 at 2:07 PM, Malaviya, Sanjay X < [email protected]> wrote: > I found script for msintsining the nutch index, but that seems to be > quite old and may be for version 0.7 If I run it I get bunch of errors. > > Parameter like bin/nutch analyze is not there in version 0.9 or 1.0 > Similarly parameter bin/index require bunch of inputs There is no > crawl/tmpfile > > ----------------------- > #!/bin/bash > > # Set JAVA_HOME to reflect your systems java configuration export > JAVA_HOME=/usr/lib/j2sdk1.5-sun > > # Start index updation > bin/nutch generate crawl.virtusa/db crawl.virtusa/segments -topN 1000 > s=`ls -d crawl.virtusa/segments/2* | tail -1` echo Segment is $s > bin/nutch fetch $s bin/nutch updatedb crawl.virtusa/db $s bin/nutch > analyze crawl.virtusa/db 5 bin/nutch index $s bin/nutch dedup > crawl.virtusa /segments crawl.virtusa/tmpfile > > # Merge segments to prevent too many open files exception in Lucene > bin/nutch mergesegs -dir crawl.virtusa/segments -i -ds s=`ls -d > crawl.virtusa/segments/2* | tail -1` echo Merged Segment is $s > > rm -rf crawl.virtusa/index > > ----------------------- > > > Sanjay > -----Original Message----- > From: Malaviya, Sanjay X > [mailto:[email protected]] > Sent: Tuesday, May 26, 2009 3:11 PM > To: [email protected] > Subject: Shell Script to maintain Nutch index > > Hi, > Does anyone has the shell script to maintain nutch index that can be > scheduled to run every day. This will take care of the updates > happening on the web sites. I need it for version 0.9 or 1.0 > > Thanks > Sanjay > > > ------------------------------------------ > The contents of this message, together with any attachments, are > intended only for the use of the person(s) to which they are addressed > and may contain confidential and/or privileged information. Further, > any medical information herein is confidential and protected by law. > It is unlawful for unauthorized persons to use, review, copy, > disclose, or disseminate confidential medical information. If you are > not the intended recipient, immediately advise the sender and delete this message and any attachments. > Any distribution, or copying of this message, or any attachment, is > prohibited. > ------------------------------------------ > The contents of this message, together with any attachments, are > intended only for the use of the person(s) to which they are addressed > and may contain confidential and/or privileged information. Further, > any medical information herein is confidential and protected by law. > It is unlawful for unauthorized persons to use, review, copy, > disclose, or disseminate confidential medical information. If you are > not the intended recipient, immediately advise the sender and delete > this message and any attachments. Any distribution, or copying of this > message, or any attachment, is prohibited. > ------------------------------------------ The contents of this message, together with any attachments, are intended only for the use of the person(s) to which they are addressed and may contain confidential and/or privileged information. Further, any medical information herein is confidential and protected by law. It is unlawful for unauthorized persons to use, review, copy, disclose, or disseminate confidential medical information. If you are not the intended recipient, immediately advise the sender and delete this message and any attachments. Any distribution, or copying of this message, or any attachment, is prohibited.
