hi,

you shouldnt open the crc file you have to open the other one, which is 
part-00000.
use vi top edit part-0000.
if you will not find this file so your dump failed...just check the 
logs/hadoop.log file






> Subject: RE: how to force nutch to do a recrawl
> Date: Fri, 11 Dec 2009 09:14:26 -0500
> From: [email protected]
> To: [email protected]
> 
> Adam,
> I'm using cygwin to run the scripts.  I use EditPlus to edit the files.  But 
> EditPlus won't allow me to edit the crc file.  I'll see if I can ftp the file 
> to a unix machine.
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 12500 Fair Lakes Circle
> Room 3507
> Fairfax, VA 22033
> Tel:  703-222-9207
> 
> www.sra.com
> This electronic message transmission contains information from SRA 
> International, Inc. which may be confidential, privileged or proprietary.  
> The information is intended for the use of the individual or entity named 
> above.  If you are not the intended recipient, be aware that any disclosure, 
> copying, distribution, or use of the contents of this information is strictly 
> prohibited.  If you have received this electronic information in error, 
> please notify us immediately by telephone at 866-584-2143.
> 
> 
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:[email protected]]
> Sent: Thu 12/10/2009 6:43 PM
> To: [email protected]
> Subject: RE: how to force nutch to do a recrawl
>  
> 
> 
> bu8t how you are running sh scripts...
> you have to use cygwin to be able to edit linux files
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > From: [email protected]
> > To: [email protected]
> > 
> > Adam,
> > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > recognize vi.  Any idea for opening it in windows?  Notepad didn't work
> > either.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the individual
> > or entity named above.  If you are not the intended recipient, be aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:[email protected]] 
> > Sent: Thursday, December 10, 2009 4:01 PM
> > To: [email protected]
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > jus use vi or vim
> > 
> > 
> > i use vi to edit the file
> > 
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > From: [email protected]
> > > To: [email protected]
> > > 
> > > Adam,
> > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > advance!
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:[email protected]] 
> > > Sent: Thursday, December 10, 2009 3:48 PM
> > > To: [email protected]
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > it will not dump to the console !
> > > whole_db is a folder and you have to edit the file you will find in
> > this
> > > folder
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > From: [email protected]
> > > > To: [email protected]
> > > > 
> > > > Adam,
> > > > I tried running that command and get the following (it created a
> > > > whole_db directory, but it's not dumping out the contents to the
> > > > console):
> > > > 
> > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > CrawlDb dump: starting
> > > > CrawlDb db: crawl/crawldb/
> > > > CrawlDb dump: done
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:[email protected]] 
> > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > To: [email protected]
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > hi,
> > > > check the fetch time in your crawldb...you can dump all the crawldb
> > > like
> > > > this:
> > > > 
> > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > 
> > > > entries will look like this:
> > > > 
> > > > http://www.YOUR_URL_TO_FETCH
> > > > Status: 2 (db_fetched)
> > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > Retries since fetch: 0
> > > > Retry interval: 18000 seconds (0 days)
> > > > Score: 0.0014977538
> > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > Metadata: _pst_: success(1), lastModified=0
> > > > 
> > > > 
> > > > as you see the next time the page will be fetched is in fetch time
> > :
> > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > and check the rety interval : it should be your 3600. 
> > > > 
> > > > hope it will help
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > From: [email protected]
> > > > > To: [email protected]
> > > > > 
> > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> > > had
> > > > > created, but I couldn't get them to work.
> > > > > 
> > > > > Thanks much.
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient, be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > > 
> > > > > -----Original Message-----
> > > > > From: MilleBii [mailto:[email protected]] 
> > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > To: [email protected]
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > 
> > > > > I don't that you can use nutch crawl command to do that, this is a
> > > one
> > > > > stop
> > > > > shop command.
> > > > > You probably want to use individual commands.
> > > > > Type nutch generate to get the help and you will see the option
> > > > > -adddays,
> > > > > read that page on the wiki to get a feel how you should do:
> > > > > http://wiki.apache.org/nutch/Crawl
> > > > > 
> > > > > 2009/12/9 Peters, Vijaya <[email protected]>
> > > > > 
> > > > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> > > set
> > > > > > numberDays? I have regular expressions to include/exclude
> > certain
> > > > > extensions
> > > > > > and certain urls, but that's all I have in there.
> > > > > >
> > > > > > Please send me an example and I'll give it a try.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive
> > > > > > years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: xiao yang [mailto:[email protected]]
> > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > >
> > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > <[email protected]>
> > > > > > wrote:
> > > > > > > I tried that too.
> > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > effect.
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > >  <value>0</value>
> > > > > > >  <description>(DEPRECATED) The default number of days between
> > > > > re-fetches
> > > > > > of a page.  value was 30
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The default number of seconds between re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > (30 days). value was 2592000 (30 days)
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The maximum number of seconds between re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > >  (90 days). After this period every page in the db will be
> > > > re-tried,
> > > > > no
> > > > > > >  matter what is its status.  value was 7776000
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > > consecutive years
> > > > > > > P Please consider the environment before printing this e-mail
> > > > > > > This electronic message transmission contains information from
> > > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: MilleBii [mailto:[email protected]]
> > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > Nutch only recrawl every 30 days by default. So you set the
> > > > > numberDays
> > > > > > > adequately and it wil recrawl read nutch-default.xml to get
> > the
> > > > > > > details
> > > > > > >
> > > > > > > 2009/12/9, xiao yang <[email protected]>:
> > > > > > >> What do you mean by "recrawl"?
> > > > > > >> Does the following command meets what you need?
> > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > >> Change the destination directory to a different one with the
> > > last
> > > > > crawl.
> > > > > > >>
> > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > <[email protected]>
> > > > > > >> wrote:
> > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
> > do
> > > a
> > > > > > complete
> > > > > > >>> recrawl?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> thanks,
> > > > > > >>>
> > > > > > >>> - Vijaya
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Vijaya Peters
> > > > > > >>> SRA International, Inc.
> > > > > > >>> 4350 Fair Lakes Court North
> > > > > > >>> Room 4004
> > > > > > >>> Fairfax, VA  22033
> > > > > > >>> Tel:  703-502-1184
> > > > > > >>>
> > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> > > 10
> > > > > > >>> consecutive years
> > > > > > >>>
> > > > > > >>> P Please consider the environment before printing this
> > e-mail
> > > > > > >>>
> > > > > > >>> This electronic message transmission contains information
> > from
> > > > SRA
> > > > > > >>> International, Inc. which may be confidential, privileged or
> > > > > > >>> proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > >>> or entity named above.  If you are not the intended
> > recipient,
> > > > be
> > > > > aware
> > > > > > >>> that any disclosure, copying, distribution, or use of the
> > > > contents
> > > > > of
> > > > > > >>> this information is strictly prohibited.  If you have
> > received
> > > > > this
> > > > > > >>> electronic information in error, please notify us
> > immediately
> > > by
> > > > > > >>> telephone at 866-584-2143.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -MilleBii-
> > > > > > >
> > > > > >
> > > > > 
> > > > > 
> > > > > 
> > > > > -- 
> > > > > -MilleBii-
> > > >                                           
> > > > _________________________________________________________________
> > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> > > they
> > > > e-mail you.
> > > > http://go.microsoft.com/?linkid=9691817
> > >                                     
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're up to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >                                       
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>                                         
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before 
> Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
> 
                                          
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on 
Facebook.
http://go.microsoft.com/?linkid=9691816

Reply via email to