Adam,
I'm using cygwin to run the scripts.  I use EditPlus to edit the files.  But 
EditPlus won't allow me to edit the crc file.  I'll see if I can ftp the file 
to a unix machine.


Vijaya Peters
SRA International, Inc.
12500 Fair Lakes Circle
Room 3507
Fairfax, VA 22033
Tel:  703-222-9207

www.sra.com
This electronic message transmission contains information from SRA 
International, Inc. which may be confidential, privileged or proprietary.  The 
information is intended for the use of the individual or entity named above.  
If you are not the intended recipient, be aware that any disclosure, copying, 
distribution, or use of the contents of this information is strictly 
prohibited.  If you have received this electronic information in error, please 
notify us immediately by telephone at 866-584-2143.



-----Original Message-----
From: BELLINI ADAM [mailto:[email protected]]
Sent: Thu 12/10/2009 6:43 PM
To: [email protected]
Subject: RE: how to force nutch to do a recrawl
 


bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files




> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 16:09:13 -0500
> From: [email protected]
> To: [email protected]
> 
> Adam,
> I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> recognize vi.  Any idea for opening it in windows?  Notepad didn't work
> either.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:[email protected]] 
> Sent: Thursday, December 10, 2009 4:01 PM
> To: [email protected]
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> jus use vi or vim
> 
> 
> i use vi to edit the file
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > From: [email protected]
> > To: [email protected]
> > 
> > Adam,
> > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> advance!
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:[email protected]] 
> > Sent: Thursday, December 10, 2009 3:48 PM
> > To: [email protected]
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > it will not dump to the console !
> > whole_db is a folder and you have to edit the file you will find in
> this
> > folder
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > From: [email protected]
> > > To: [email protected]
> > > 
> > > Adam,
> > > I tried running that command and get the following (it created a
> > > whole_db directory, but it's not dumping out the contents to the
> > > console):
> > > 
> > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > CrawlDb dump: starting
> > > CrawlDb db: crawl/crawldb/
> > > CrawlDb dump: done
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:[email protected]] 
> > > Sent: Thursday, December 10, 2009 1:40 PM
> > > To: [email protected]
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > hi,
> > > check the fetch time in your crawldb...you can dump all the crawldb
> > like
> > > this:
> > > 
> > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > 
> > > entries will look like this:
> > > 
> > > http://www.YOUR_URL_TO_FETCH
> > > Status: 2 (db_fetched)
> > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > Retries since fetch: 0
> > > Retry interval: 18000 seconds (0 days)
> > > Score: 0.0014977538
> > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > Metadata: _pst_: success(1), lastModified=0
> > > 
> > > 
> > > as you see the next time the page will be fetched is in fetch time
> :
> > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > and check the rety interval : it should be your 3600. 
> > > 
> > > hope it will help
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > From: [email protected]
> > > > To: [email protected]
> > > > 
> > > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> > had
> > > > created, but I couldn't get them to work.
> > > > 
> > > > Thanks much.
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:[email protected]] 
> > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > To: [email protected]
> > > > Subject: Re: how to force nutch to do a recrawl
> > > > 
> > > > I don't that you can use nutch crawl command to do that, this is a
> > one
> > > > stop
> > > > shop command.
> > > > You probably want to use individual commands.
> > > > Type nutch generate to get the help and you will see the option
> > > > -adddays,
> > > > read that page on the wiki to get a feel how you should do:
> > > > http://wiki.apache.org/nutch/Crawl
> > > > 
> > > > 2009/12/9 Peters, Vijaya <[email protected]>
> > > > 
> > > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> > set
> > > > > numberDays? I have regular expressions to include/exclude
> certain
> > > > extensions
> > > > > and certain urls, but that's all I have in there.
> > > > >
> > > > > Please send me an example and I'll give it a try.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive
> > > > > years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: xiao yang [mailto:[email protected]]
> > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > To: [email protected]
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > >
> > > > > What about the configuration in crawl-urlfilter.txt?
> > > > >
> > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > <[email protected]>
> > > > > wrote:
> > > > > > I tried that too.
> > > > > > in Nutch-site.xml, I added in the below, but this had no
> effect.
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.default.fetch.interval</name>
> > > > > >  <value>0</value>
> > > > > >  <description>(DEPRECATED) The default number of days between
> > > > re-fetches
> > > > > of a page.  value was 30
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.default</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The default number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > (30 days). value was 2592000 (30 days)
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.max</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The maximum number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > >  (90 days). After this period every page in the db will be
> > > re-tried,
> > > > no
> > > > > >  matter what is its status.  value was 7776000
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:[email protected]]
> > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > Nutch only recrawl every 30 days by default. So you set the
> > > > numberDays
> > > > > > adequately and it wil recrawl read nutch-default.xml to get
> the
> > > > > > details
> > > > > >
> > > > > > 2009/12/9, xiao yang <[email protected]>:
> > > > > >> What do you mean by "recrawl"?
> > > > > >> Does the following command meets what you need?
> > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > >> Change the destination directory to a different one with the
> > last
> > > > crawl.
> > > > > >>
> > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > <[email protected]>
> > > > > >> wrote:
> > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
> do
> > a
> > > > > complete
> > > > > >>> recrawl?
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> thanks,
> > > > > >>>
> > > > > >>> - Vijaya
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Vijaya Peters
> > > > > >>> SRA International, Inc.
> > > > > >>> 4350 Fair Lakes Court North
> > > > > >>> Room 4004
> > > > > >>> Fairfax, VA  22033
> > > > > >>> Tel:  703-502-1184
> > > > > >>>
> > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > >>> consecutive years
> > > > > >>>
> > > > > >>> P Please consider the environment before printing this
> e-mail
> > > > > >>>
> > > > > >>> This electronic message transmission contains information
> from
> > > SRA
> > > > > >>> International, Inc. which may be confidential, privileged or
> > > > > >>> proprietary.  The information is intended for the use of the
> > > > individual
> > > > > >>> or entity named above.  If you are not the intended
> recipient,
> > > be
> > > > aware
> > > > > >>> that any disclosure, copying, distribution, or use of the
> > > contents
> > > > of
> > > > > >>> this information is strictly prohibited.  If you have
> received
> > > > this
> > > > > >>> electronic information in error, please notify us
> immediately
> > by
> > > > > >>> telephone at 866-584-2143.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -MilleBii-
> > > > > >
> > > > >
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > -MilleBii-
> > >                                     
> > > _________________________________________________________________
> > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> > they
> > > e-mail you.
> > > http://go.microsoft.com/?linkid=9691817
> >                                       
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>                                         
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
                                          
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Reply via email to