hi, you shouldnt open the crc file you have to open the other one, which is part-00000. use vi top edit part-0000. if you will not find this file so your dump failed...just check the logs/hadoop.log file
> Subject: RE: how to force nutch to do a recrawl > Date: Fri, 11 Dec 2009 09:14:26 -0500 > From: [email protected] > To: [email protected] > > Adam, > I'm using cygwin to run the scripts. I use EditPlus to edit the files. But > EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file > to a unix machine. > > > Vijaya Peters > SRA International, Inc. > 12500 Fair Lakes Circle > Room 3507 > Fairfax, VA 22033 > Tel: 703-222-9207 > > www.sra.com > This electronic message transmission contains information from SRA > International, Inc. which may be confidential, privileged or proprietary. > The information is intended for the use of the individual or entity named > above. If you are not the intended recipient, be aware that any disclosure, > copying, distribution, or use of the contents of this information is strictly > prohibited. If you have received this electronic information in error, > please notify us immediately by telephone at 866-584-2143. > > > > -----Original Message----- > From: BELLINI ADAM [mailto:[email protected]] > Sent: Thu 12/10/2009 6:43 PM > To: [email protected] > Subject: RE: how to force nutch to do a recrawl > > > > bu8t how you are running sh scripts... > you have to use cygwin to be able to edit linux files > > > > > > Subject: RE: how to force nutch to do a recrawl > > Date: Thu, 10 Dec 2009 16:09:13 -0500 > > From: [email protected] > > To: [email protected] > > > > Adam, > > I'm on windows unfortunately!! I'm using cygdrive, but it doesn't > > recognize vi. Any idea for opening it in windows? Notepad didn't work > > either. > > > > Vijaya Peters > > SRA International, Inc. > > 4350 Fair Lakes Court North > > Room 4004 > > Fairfax, VA 22033 > > Tel: 703-502-1184 > > > > www.sra.com > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > consecutive years > > P Please consider the environment before printing this e-mail > > This electronic message transmission contains information from SRA > > International, Inc. which may be confidential, privileged or > > proprietary. The information is intended for the use of the individual > > or entity named above. If you are not the intended recipient, be aware > > that any disclosure, copying, distribution, or use of the contents of > > this information is strictly prohibited. If you have received this > > electronic information in error, please notify us immediately by > > telephone at 866-584-2143. > > > > -----Original Message----- > > From: BELLINI ADAM [mailto:[email protected]] > > Sent: Thursday, December 10, 2009 4:01 PM > > To: [email protected] > > Subject: RE: how to force nutch to do a recrawl > > > > > > jus use vi or vim > > > > > > i use vi to edit the file > > > > > > > > > > > > > Subject: RE: how to force nutch to do a recrawl > > > Date: Thu, 10 Dec 2009 15:58:24 -0500 > > > From: [email protected] > > > To: [email protected] > > > > > > Adam, > > > What do I use to open a CRC file? I tried QuickSFV. Thanks in > > advance! > > > > > > Vijaya Peters > > > SRA International, Inc. > > > 4350 Fair Lakes Court North > > > Room 4004 > > > Fairfax, VA 22033 > > > Tel: 703-502-1184 > > > > > > www.sra.com > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > consecutive years > > > P Please consider the environment before printing this e-mail > > > This electronic message transmission contains information from SRA > > > International, Inc. which may be confidential, privileged or > > > proprietary. The information is intended for the use of the > > individual > > > or entity named above. If you are not the intended recipient, be > > aware > > > that any disclosure, copying, distribution, or use of the contents of > > > this information is strictly prohibited. If you have received this > > > electronic information in error, please notify us immediately by > > > telephone at 866-584-2143. > > > > > > -----Original Message----- > > > From: BELLINI ADAM [mailto:[email protected]] > > > Sent: Thursday, December 10, 2009 3:48 PM > > > To: [email protected] > > > Subject: RE: how to force nutch to do a recrawl > > > > > > > > > it will not dump to the console ! > > > whole_db is a folder and you have to edit the file you will find in > > this > > > folder > > > > > > > > > > > > > Subject: RE: how to force nutch to do a recrawl > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500 > > > > From: [email protected] > > > > To: [email protected] > > > > > > > > Adam, > > > > I tried running that command and get the following (it created a > > > > whole_db directory, but it's not dumping out the contents to the > > > > console): > > > > > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db > > > > CrawlDb dump: starting > > > > CrawlDb db: crawl/crawldb/ > > > > CrawlDb dump: done > > > > > > > > Vijaya Peters > > > > SRA International, Inc. > > > > 4350 Fair Lakes Court North > > > > Room 4004 > > > > Fairfax, VA 22033 > > > > Tel: 703-502-1184 > > > > > > > > www.sra.com > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > > consecutive years > > > > P Please consider the environment before printing this e-mail > > > > This electronic message transmission contains information from SRA > > > > International, Inc. which may be confidential, privileged or > > > > proprietary. The information is intended for the use of the > > > individual > > > > or entity named above. If you are not the intended recipient, be > > > aware > > > > that any disclosure, copying, distribution, or use of the contents > > of > > > > this information is strictly prohibited. If you have received this > > > > electronic information in error, please notify us immediately by > > > > telephone at 866-584-2143. > > > > -----Original Message----- > > > > From: BELLINI ADAM [mailto:[email protected]] > > > > Sent: Thursday, December 10, 2009 1:40 PM > > > > To: [email protected] > > > > Subject: RE: how to force nutch to do a recrawl > > > > > > > > > > > > hi, > > > > check the fetch time in your crawldb...you can dump all the crawldb > > > like > > > > this: > > > > > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db > > > > > > > > entries will look like this: > > > > > > > > http://www.YOUR_URL_TO_FETCH > > > > Status: 2 (db_fetched) > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009 > > > > Modified time: Wed Dec 31 19:00:00 EST 1969 > > > > Retries since fetch: 0 > > > > Retry interval: 18000 seconds (0 days) > > > > Score: 0.0014977538 > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c > > > > Metadata: _pst_: success(1), lastModified=0 > > > > > > > > > > > > as you see the next time the page will be fetched is in fetch time > > : > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009' > > > > and check the rety interval : it should be your 3600. > > > > > > > > hope it will help > > > > > > > > > > > > > Subject: RE: how to force nutch to do a recrawl > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500 > > > > > From: [email protected] > > > > > To: [email protected] > > > > > > > > > > Okay. I'll dig a little deeper. I saw a few scripts that people > > > had > > > > > created, but I couldn't get them to work. > > > > > > > > > > Thanks much. > > > > > > > > > > Vijaya Peters > > > > > SRA International, Inc. > > > > > 4350 Fair Lakes Court North > > > > > Room 4004 > > > > > Fairfax, VA 22033 > > > > > Tel: 703-502-1184 > > > > > > > > > > www.sra.com > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > > > consecutive years > > > > > P Please consider the environment before printing this e-mail > > > > > This electronic message transmission contains information from SRA > > > > > International, Inc. which may be confidential, privileged or > > > > > proprietary. The information is intended for the use of the > > > > individual > > > > > or entity named above. If you are not the intended recipient, be > > > > aware > > > > > that any disclosure, copying, distribution, or use of the contents > > > of > > > > > this information is strictly prohibited. If you have received > > this > > > > > electronic information in error, please notify us immediately by > > > > > telephone at 866-584-2143. > > > > > > > > > > -----Original Message----- > > > > > From: MilleBii [mailto:[email protected]] > > > > > Sent: Wednesday, December 09, 2009 4:05 PM > > > > > To: [email protected] > > > > > Subject: Re: how to force nutch to do a recrawl > > > > > > > > > > I don't that you can use nutch crawl command to do that, this is a > > > one > > > > > stop > > > > > shop command. > > > > > You probably want to use individual commands. > > > > > Type nutch generate to get the help and you will see the option > > > > > -adddays, > > > > > read that page on the wiki to get a feel how you should do: > > > > > http://wiki.apache.org/nutch/Crawl > > > > > > > > > > 2009/12/9 Peters, Vijaya <[email protected]> > > > > > > > > > > > I didn't see a setting to override in crawl-urlfilter. How do I > > > set > > > > > > numberDays? I have regular expressions to include/exclude > > certain > > > > > extensions > > > > > > and certain urls, but that's all I have in there. > > > > > > > > > > > > Please send me an example and I'll give it a try. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Vijaya Peters > > > > > > SRA International, Inc. > > > > > > 4350 Fair Lakes Court North > > > > > > Room 4004 > > > > > > Fairfax, VA 22033 > > > > > > Tel: 703-502-1184 > > > > > > > > > > > > www.sra.com > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10 > > > > > consecutive > > > > > > years > > > > > > P Please consider the environment before printing this e-mail > > > > > > This electronic message transmission contains information from > > SRA > > > > > > International, Inc. which may be confidential, privileged or > > > > > proprietary. > > > > > > The information is intended for the use of the individual or > > > entity > > > > > named > > > > > > above. If you are not the intended recipient, be aware that any > > > > > disclosure, > > > > > > copying, distribution, or use of the contents of this > > information > > > is > > > > > > strictly prohibited. If you have received this electronic > > > > information > > > > > in > > > > > > error, please notify us immediately by telephone at > > 866-584-2143. > > > > > > > > > > > > -----Original Message----- > > > > > > From: xiao yang [mailto:[email protected]] > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM > > > > > > To: [email protected] > > > > > > Subject: Re: how to force nutch to do a recrawl > > > > > > > > > > > > What about the configuration in crawl-urlfilter.txt? > > > > > > > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya > > > > > <[email protected]> > > > > > > wrote: > > > > > > > I tried that too. > > > > > > > in Nutch-site.xml, I added in the below, but this had no > > effect. > > > > > > > > > > > > > > <property> > > > > > > > <name>db.default.fetch.interval</name> > > > > > > > <value>0</value> > > > > > > > <description>(DEPRECATED) The default number of days between > > > > > re-fetches > > > > > > of a page. value was 30 > > > > > > > </description> > > > > > > > </property> > > > > > > > > > > > > > > <property> > > > > > > > <name>db.fetch.interval.default</name> > > > > > > > <value>3600</value> > > > > > > > <description>The default number of seconds between re-fetches > > > of > > > > a > > > > > page > > > > > > (30 days). value was 2592000 (30 days) > > > > > > > </description> > > > > > > > </property> > > > > > > > > > > > > > > <property> > > > > > > > <name>db.fetch.interval.max</name> > > > > > > > <value>3600</value> > > > > > > > <description>The maximum number of seconds between re-fetches > > > of > > > > a > > > > > page > > > > > > > (90 days). After this period every page in the db will be > > > > re-tried, > > > > > no > > > > > > > matter what is its status. value was 7776000 > > > > > > > </description> > > > > > > > </property> > > > > > > > > > > > > > > Vijaya Peters > > > > > > > SRA International, Inc. > > > > > > > 4350 Fair Lakes Court North > > > > > > > Room 4004 > > > > > > > Fairfax, VA 22033 > > > > > > > Tel: 703-502-1184 > > > > > > > > > > > > > > www.sra.com > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for > > 10 > > > > > > consecutive years > > > > > > > P Please consider the environment before printing this e-mail > > > > > > > This electronic message transmission contains information from > > > SRA > > > > > > International, Inc. which may be confidential, privileged or > > > > > proprietary. > > > > > > The information is intended for the use of the individual or > > > entity > > > > > named > > > > > > above. If you are not the intended recipient, be aware that any > > > > > disclosure, > > > > > > copying, distribution, or use of the contents of this > > information > > > is > > > > > > strictly prohibited. If you have received this electronic > > > > information > > > > > in > > > > > > error, please notify us immediately by telephone at > > 866-584-2143. > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: MilleBii [mailto:[email protected]] > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM > > > > > > > To: [email protected] > > > > > > > Subject: Re: how to force nutch to do a recrawl > > > > > > > > > > > > > > Nutch only recrawl every 30 days by default. So you set the > > > > > numberDays > > > > > > > adequately and it wil recrawl read nutch-default.xml to get > > the > > > > > > > details > > > > > > > > > > > > > > 2009/12/9, xiao yang <[email protected]>: > > > > > > >> What do you mean by "recrawl"? > > > > > > >> Does the following command meets what you need? > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > > > >> Change the destination directory to a different one with the > > > last > > > > > crawl. > > > > > > >> > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya > > > > > <[email protected]> > > > > > > >> wrote: > > > > > > >>> I'm running Nutch 1.0 in windows. How do I force Nutch to > > do > > > a > > > > > > complete > > > > > > >>> recrawl? > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> thanks, > > > > > > >>> > > > > > > >>> - Vijaya > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> Vijaya Peters > > > > > > >>> SRA International, Inc. > > > > > > >>> 4350 Fair Lakes Court North > > > > > > >>> Room 4004 > > > > > > >>> Fairfax, VA 22033 > > > > > > >>> Tel: 703-502-1184 > > > > > > >>> > > > > > > >>> www.sra.com <http://www.sra.com/> > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for > > > 10 > > > > > > >>> consecutive years > > > > > > >>> > > > > > > >>> P Please consider the environment before printing this > > e-mail > > > > > > >>> > > > > > > >>> This electronic message transmission contains information > > from > > > > SRA > > > > > > >>> International, Inc. which may be confidential, privileged or > > > > > > >>> proprietary. The information is intended for the use of the > > > > > individual > > > > > > >>> or entity named above. If you are not the intended > > recipient, > > > > be > > > > > aware > > > > > > >>> that any disclosure, copying, distribution, or use of the > > > > contents > > > > > of > > > > > > >>> this information is strictly prohibited. If you have > > received > > > > > this > > > > > > >>> electronic information in error, please notify us > > immediately > > > by > > > > > > >>> telephone at 866-584-2143. > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -MilleBii- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -MilleBii- > > > > > > > > _________________________________________________________________ > > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when > > > they > > > > e-mail you. > > > > http://go.microsoft.com/?linkid=9691817 > > > > > > _________________________________________________________________ > > > Windows Live: Make it easier for your friends to see what you're up to > > > on Facebook. > > > http://go.microsoft.com/?linkid=9691816 > > > > _________________________________________________________________ > > Windows Live: Make it easier for your friends to see what you're up to > > on Facebook. > > http://go.microsoft.com/?linkid=9691816 > > _________________________________________________________________ > Eligible CDN College & University students can upgrade to Windows 7 before > Jan 3 for only $39.99. Upgrade now! > http://go.microsoft.com/?linkid=9691819 > _________________________________________________________________ Windows Live: Make it easier for your friends to see what you’re up to on Facebook. http://go.microsoft.com/?linkid=9691816
