Sorry....when I cut/paste the script below, it didn't comeout proper. So
just re-aligned the lines within it. Please review it and let me know
where I am going wrong.

Thanks a lot.

-----Original Message-----
From: Malaviya, Sanjay X 
Sent: Thursday, May 28, 2009 1:57 PM
To: [email protected]
Subject: Recrawl not picking up changes to the web site.

 
Please Help!!!
My script that is suppose to recrawl and pick up new and modified
documents in the web site is not working. It's not throwing any errors
but not picking up any changes I make to the files or add new document.
The script I have is --
---------------------------------
#!/bin/bash

# tomcat_dir=$1  crawl_dir=$2   depth=$3  adddays=$4  topn="-topN $5"

# Set JAVA_HOME to reflect your systems java configuration 
export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.6.0_01'

# Set the paths
nutch_dir='/cygdrive/d/inet/apps/nutch-0.9/bin'
crawl_dir='/cygdrive/d/inet/apps/nutch-0.9/crawl'
tomcat_dir='/cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9'
depth=10

# Only change if your crawl subdirectories are named something different

webdb_dir=$crawl_dir/crawldb 
segments_dir=$crawl_dir/segments 
linkdb_dir=$crawl_dir/linkdb 
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  segment=`ls -d crawl/segments/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb crawl/crawldb crawl/segments
done

# Merge segments and cleanup unused segments 
mergesegs_dir=crawl/mergesegs_dir 
bin/nutch mergesegs crawl/mergesegs_dir -dir crawl/segments

for segment in `ls -d crawl/segments/* | tail -10` 
do
  echo "Removing Temporary Segment: $segment"
  rm -rf $segment
done

cp -R crawl/mergesegs_dir/* crawl/segments 
rm -rf crawl/mergesegs_dir

# Update segments
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

# Index segments
new_indexes=crawl/newindexes
segment=`ls -d crawl/segments/* | tail -1` 
bin/nutch index crawl/newindexes crawl/crawldb crawl/linkdb $segment

# De-duplicate indexes
bin/nutch dedup crawl/newindexes

# Merge indexes
bin/nutch merge crawl/index crawl/newindexes

# Tell Tomcat to reload index
touch /cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9/WEB-INF/web.xml

# Clean up
rm -rf crawl/newindexes

--------------------------------- 

Sanjay
-----Original Message-----
From: Kenan Azam [mailto:[email protected]]
Sent: Tuesday, May 26, 2009 4:41 PM
To: [email protected]
Subject: Re: Shell Script to maintain Nutch index

here is a url to scripts for nutch 0.8 and 0.9
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591
c293aead539a017ec7




On Tue, May 26, 2009 at 2:07 PM, Malaviya, Sanjay X <
[email protected]> wrote:

> I found script for msintsining the nutch index, but that seems to be 
> quite old and may be for version 0.7 If I run it I get bunch of
errors.
>
> Parameter like bin/nutch analyze is not there in version 0.9 or 1.0 
> Similarly parameter bin/index require bunch of inputs There is no 
> crawl/tmpfile
>
> -----------------------
> #!/bin/bash
>
>  # Set JAVA_HOME to reflect your systems java configuration  export 
> JAVA_HOME=/usr/lib/j2sdk1.5-sun
>
>  # Start index updation
>  bin/nutch generate crawl.virtusa/db crawl.virtusa/segments -topN 1000

> s=`ls -d crawl.virtusa/segments/2* | tail -1`  echo Segment is $s 
> bin/nutch fetch $s  bin/nutch updatedb crawl.virtusa/db $s  bin/nutch 
> analyze crawl.virtusa/db 5  bin/nutch index $s  bin/nutch dedup 
> crawl.virtusa /segments crawl.virtusa/tmpfile
>
>  # Merge segments to prevent too many open files exception in Lucene 
> bin/nutch mergesegs -dir crawl.virtusa/segments -i -ds  s=`ls -d
> crawl.virtusa/segments/2* | tail -1`  echo Merged Segment is $s
>
>  rm -rf crawl.virtusa/index
>
> -----------------------
>
>
> Sanjay
> -----Original Message-----
> From: Malaviya, Sanjay X
> [mailto:[email protected]]
> Sent: Tuesday, May 26, 2009 3:11 PM
> To: [email protected]
> Subject: Shell Script to maintain Nutch index
>
> Hi,
> Does anyone has the shell script to maintain nutch index that can be 
> scheduled to run every day. This will take care of the updates 
> happening on the web sites. I need it for version 0.9 or 1.0
>
> Thanks
> Sanjay
>
>
> ------------------------------------------
> The contents of this message, together with any attachments, are 
> intended only for the use of the person(s) to which they are addressed

> and may contain confidential and/or privileged information. Further, 
> any medical information herein is confidential and protected by law.
> It is unlawful for unauthorized persons to use, review, copy, 
> disclose, or disseminate confidential medical information. If you are 
> not the intended recipient, immediately advise the sender and delete
this message and any attachments.
> Any distribution, or copying of this message, or any attachment, is 
> prohibited.
> ------------------------------------------
> The contents of this message, together with any attachments, are 
> intended only for the use of the person(s) to which they are addressed

> and may contain confidential and/or privileged information. Further, 
> any medical information herein is confidential and protected by law.
> It is unlawful for unauthorized persons to use, review, copy, 
> disclose, or disseminate confidential medical information. If you are 
> not the intended recipient, immediately advise the sender and delete 
> this message and any attachments. Any distribution, or copying of this

> message, or any attachment, is prohibited.
>
------------------------------------------
The contents of this message, together with any attachments, are
intended only for the use of the person(s) to which they are
addressed and may contain confidential and/or privileged
information. Further, any medical information herein is
confidential and protected by law. It is unlawful for unauthorized
persons to use, review, copy, disclose, or disseminate confidential
medical information. If you are not the intended recipient,
immediately advise the sender and delete this message and any
attachments. Any distribution, or copying of this message, or any
attachment, is prohibited.

Reply via email to