Re: Recrawl script for 0.8.0 completed...

Thomas Delnoij Tue, 25 Jul 2006 12:31:46 -0700

Lourival.

I have typically seen the same issues on a cygwin/windows setup. The
only thing that worked for me was shutting down and restarting tomcat,
instead of just reloading the context. On linux now I don't have these
issues anymore.


Rgrds, Thomas

On 7/21/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:

Ok. However a few minutes ago I ran the script exactly you said and I still
get this error:

Exception in thread "main" java.io.IOException: Cannot delete _0.f0
        at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
        at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
        at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

I dont know but I thing it occurs because nutch tries to delete some file
that tomcat loads to the memory, giving permission access error. Any idea?

On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>
> Lourival Júnior wrote:
> > I thing it wont work with me because i'm using the Nutch version 0.7.2.
> > Actually I use this script (some comments are in Portuguese):
> >
> > #!/bin/bash
> >
> > # A simple script to run a Nutch re-crawl
> > # Fonte do script:
> >
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> >
> > #{
> >
> > if [ -n "$1" ]
> > then
> >  crawl_dir=$1
> > else
> >  echo "Usage: recrawl crawl_dir [depth] [adddays]"
> >  exit 1
> > fi
> >
> > if [ -n "$2" ]
> > then
> >  depth=$2
> > else
> >  depth=5
> > fi
> >
> > if [ -n "$3" ]
> > then
> >  adddays=$3
> > else
> >  adddays=0
> > fi
> >
> > webdb_dir=$crawl_dir/db
> > segments_dir=$crawl_dir/segments
> > index_dir=$crawl_dir/index
> >
> > #Para o serviço do TomCat
> > #net stop "Apache Tomcat"
> >
> > # The generate/fetch/update cycle
> > for ((i=1; i <= depth ; i++))
> > do
> >  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> >  segment=`ls -d $segments_dir/* | tail -1`
> >  bin/nutch fetch $segment
> >  bin/nutch updatedb $webdb_dir $segment
> >  echo
> >  echo "Fim do ciclo $i."
> >  echo
> > done
> >
> > # Update segments
> > echo
> > echo "Atualizando os Segmentos..."
> > echo
> > mkdir tmp
> > bin/nutch updatesegs $webdb_dir $segments_dir tmp
> > rm -R tmp
> >
> > # Index segments
> > echo "Indexando os segmentos..."
> > echo
> > for segment in `ls -d $segments_dir/* | tail -$depth`
> > do
> >  bin/nutch index $segment
> > done
> >
> > # De-duplicate indexes
> > # "bogus" argument is ignored but needed due to
> > # a bug in the number of args expected
> > bin/nutch dedup $segments_dir bogus
> >
> > # Merge indexes
> > #echo "Unindo os segmentos..."
> > #echo
> > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
> >
> > chmod 777 -R $index_dir
> >
> > #Inicia o serviço do TomCat
> > #net start "Apache Tomcat"
> >
> > echo "Fim."
> >
> > #} > recrawl.log 2>&1
> >
> > How you suggested I used the touch command instead stops the tomcat.
> > However
> > I get that error posted in previous message. I'm running nutch in
> windows
> > plataform with cygwin. I only get no errors when I stops the tomcat. I
> > use
> > this command to call the script:
> >
> > ./recrawl crawl-legislacao 1
> >
> > Could you give me more clarifications?
> >
> > Thanks a lot!
> >
> > On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >>
> >> Lourival Júnior wrote:
> >> > Hi Renaud!
> >> >
> >> > I'm newbie with shell scripts and I know stops tomcat service is
> >> not the
> >> > better way to do this. The problem is, when a run the re-crawl script
> >> > with
> >> > tomcat started I get this error:
> >> >
> >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> >> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
> >> >        at
> >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> >> >        at
> >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> >> >        at
> >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> >> > :141)
> >> >        at
> >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> >> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> >> :92)
> >> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
> >> :160)
> >> >
> >> > So, I want another way to re-crawl my pages without this error and
> >> > without
> >> > restarting the tomcat. Could you suggest one?
> >> >
> >> > Thanks a lot!
> >> >
> >> >
> >> Try this updated script and tell me what command exactly you run to
> call
> >> the script. Let me know the error message then.
> >>
> >> Matt
> >>
> >>
> >> #!/bin/bash
> >>
> >> # Nutch recrawl script.
> >> # Based on 0.7.2 script at
> >>
> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> >>
> >> # Modified by Matthew Holt
> >>
> >> if [ -n "$1" ]
> >> then
> >>   nutch_dir=$1
> >> else
> >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> >>   echo "servlet_path - Path of the nutch servlet (i.e.
> >> /usr/local/tomcat/webapps/ROOT)"
> >>   echo "crawl_dir - Name of the directory the crawl is located in."
> >>   echo "[depth] - The link depth from the root page that should be
> >> crawled."
> >>   echo "[adddays] - Advance the clock # of days for fetchlist
> >> generation."
> >>   exit 1
> >> fi
> >>
> >> if [ -n "$2" ]
> >> then
> >>   crawl_dir=$2
> >> else
> >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> >>   echo "servlet_path - Path of the nutch servlet (i.e.
> >> /usr/local/tomcat/webapps/ROOT)"
> >>   echo "crawl_dir - Name of the directory the crawl is located in."
> >>   echo "[depth] - The link depth from the root page that should be
> >> crawled."
> >>   echo "[adddays] - Advance the clock # of days for fetchlist
> >> generation."
> >>   exit 1
> >> fi
> >>
> >> if [ -n "$3" ]
> >> then
> >>   depth=$3
> >> else
> >>   depth=5
> >> fi
> >>
> >> if [ -n "$4" ]
> >> then
> >>   adddays=$4
> >> else
> >>   adddays=0
> >> fi
> >>
> >> # Only change if your crawl subdirectories are named something
> different
> >> webdb_dir=$crawl_dir/crawldb
> >> segments_dir=$crawl_dir/segments
> >> linkdb_dir=$crawl_dir/linkdb
> >> index_dir=$crawl_dir/index
> >>
> >> # The generate/fetch/update cycle
> >> for ((i=1; i <= depth ; i++))
> >> do
> >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> >>   segment=`ls -d $segments_dir/* | tail -1`
> >>   bin/nutch fetch $segment
> >>   bin/nutch updatedb $webdb_dir $segment
> >> done
> >>
> >> # Update segments
> >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> >>
> >> # Index segments
> >> new_indexes=$crawl_dir/newindexes
> >> #ls -d $segments_dir/* | tail -$depth | xargs
> >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> >>
> >> # De-duplicate indexes
> >> bin/nutch dedup $new_indexes
> >>
> >> # Merge indexes
> >> bin/nutch merge $index_dir $new_indexes
> >>
> >> # Tell Tomcat to reload index
> >> touch $nutch_dir/WEB-INF/web.xml
> >>
> >> # Clean up
> >> rm -rf $new_indexes
> >>
> >>
> >
> >
> Oh yea, you're right the one i sent out was for 0.8.... you should just
> be able to put this at the end of your script..
>
> # Tell Tomcat to reload index
> touch $nutch_dir/WEB-INF/web.xml
>
> and fill in the appropriate path of course.
> gluck
> matt
>



--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

Reply via email to