nutch81 pages seems were not kept but no error message found

2007-01-03 Thread Chee Wu

Hi all,
  I am using crawl tool in Nutch81 under cygwin,trying to retrieve
pages from about 2 thousand websites,and the crawl process has been
running for nearly 20 hours.
   But during the past 10 hours, the fetch status always remain the
same as below:
   TOTAL urls:  165212
   retry 0: 164110
   retry 1: 814
   retry 2: 288
   min score:   0.0
   avg score:   0.029228665
   max score:   2.333
   status 1 (DB_unfetched): 134960
   status 2 (DB_fetched):   27812
   status 3 (DB_gone):  2440
all the number in the status remain the same. DB_fetched page always
is 27812. From the console output and hadoop.log I can see the the
page fetching process is running without any error.

the size of the crawl db also have no change,always be 328M.

I have tried to solve this problem during all the last week. any hints
for this problem is appreciated. Thanks and bow~~~


Re: Fetcher2

2007-01-22 Thread chee wu
Fetcher2 should be a great help for me,but seems can't integrate with Nutch81.
Any advice on how to use it based on .81? 
- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: 
Sent: Thursday, January 18, 2007 5:18 AM
Subject: Fetcher2


> Hi all,
> 
> I just committed a new implementation of venerable fetcher, called 
> Fetcher2. It uses a producer/consumers model with a set of per-host 
> queues. Theoretically it should be able to achieve a much higher 
> throughput, especially for fetchlists with a lot of contention (many 
> urls from the same hosts).
> 
> It should be possible to achieve the same fetching rate with a smaller 
> number of threads, and most importantly to avoid the dreaded "Exceeded 
> http.max.delays: retry later" error.
> 
> It is available through "bin/nutch fetch2".
> 
> From the javadoc:
> 
> "A queue-based fetcher.
> 
> This fetcher uses a well-known model of one producer (a QueueFeeder) and 
> many consumers (FetcherThread-s).
> 
> QueueFeeder reads input fetchlists and populates a set of 
> FetchItemQueue-s, which hold FetchItem-s that describe the items to be 
> fetched. There are as many queues as there are unique hosts, but at any 
> given time the total number of fetch items in all queues is less than a 
> fixed number (currently set to a multiple of the number of threads).
> 
> As items are consumed from the queues, the QueueFeeder continues to add 
> new input items, so that their total count stays fixed (FetcherThread-s 
> may also add new items to the queues e.g. as a results of redirection) - 
> until all input items are exhausted, at which point the number of items 
> in the queues begins to decrease. When this number reaches 0 fetcher 
> will finish.
> 
> This fetcher implementation handles per-host blocking itself, instead of 
> delegating this work to protocol-specific plugins. Each per-host queue 
> handles its own "politeness" settings, such as the maximum number of 
> concurrent requests and crawl delay between consecutive requests - and 
> also a list of requests in progress, and the time the last request was 
> finished. As FetcherThread-s ask for new items to be fetched, queues may 
> return eligible items or null if for "politeness" reasons this host's 
> queue is not yet ready.
> 
> If there are still unfetched items on the queues, but none of the items 
> are ready, FetcherThread-s will spin-wait until either some items become 
> available, or a timeout is reached (at which point the Fetcher will 
> abort, assuming the task is hung)."
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
>

Re: Fetcher2

2007-01-23 Thread chee wu
Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I can 
share the code,if any one want to use ..
- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: 
Sent: Tuesday, January 23, 2007 12:09 AM
Subject: Re: Fetcher2


> chee wu wrote:
>> Fetcher2 should be a great help for me,but seems can't integrate with 
>> Nutch81.
>> Any advice on how to use it based on .81? 
>>   
> 
> You would have to port it to Nutch 0.8.1 - e.g. change all Text 
> occurences to UTF8, and most likely make other changes too ...
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
>

Re: Fetcher2

2007-01-24 Thread chee wu
Just appended the portion for .81  to NUTCH-339

- Original Message - 
From: "Armel T. Nene" <[EMAIL PROTECTED]>
To: 
Sent: Thursday, January 25, 2007 8:06 AM
Subject: RE: Fetcher2


> Chee,
> 
> Can you make the code available through Jira.
> 
> Thanks,
> 
> Armel
> 
> -
> Armel T. Nene
> iDNA Solutions
> Tel: +44 (207) 257 6124
> Mobile: +44 (788) 695 0483 
> http://blog.idna-solutions.com
> 
> -Original Message-
> From: chee wu [mailto:[EMAIL PROTECTED] 
> Sent: 24 January 2007 03:59
> To: nutch-dev@lucene.apache.org
> Subject: Re: Fetcher2
> 
> Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I
> can share the code,if any one want to use ..
> - Original Message - 
> From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
> To: 
> Sent: Tuesday, January 23, 2007 12:09 AM
> Subject: Re: Fetcher2
> 
> 
>> chee wu wrote:
>>> Fetcher2 should be a great help for me,but seems can't integrate with
> Nutch81.
>>> Any advice on how to use it based on .81? 
>>>   
>> 
>> You would have to port it to Nutch 0.8.1 - e.g. change all Text 
>> occurences to UTF8, and most likely make other changes too ...
>> 
>> -- 
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _   __
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>> 
>> 
>>
> 
>

Re: Modified date in crawldb

2007-01-25 Thread chee wu
I also had this question a few days ago,and I am using Nutch0.8.1.It seems the 
"Modified data" will be used by Nutch-61, you can find detail at the link 
below: 
 http://issues.apache.org/jira/browse/NUTCH-61

I haven't studied this JIRA, and  just  wrote a simple function  to fulfill 
this.
1.Retrieve all the Date information contained in the page content, Regular 
Expression is used to identify the date information.
2.Chose the newest date got as the page modified date.
3.Call  the method of  "setModifiedTime( )"  of the crawlDataum object in 
FetcherThread.Output( ).
Maybe you can use a parse filter to separate this function from the core code.
I am also new to Nutch, if  anything  wrong ,please feel free point out.


- Original Message - 
From: "Armel T. Nene" <[EMAIL PROTECTED]>
To: 
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb


> Hi guys,
> 
> 
> 
> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
> save the last modified date of files. I have run a crawl on my local file
> system and the web. When I dumped the content of crawldb for both crawl, the
> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
> intended to be as is or if it's a bug. Therefore my question is:
> 
> 
> 
> * How does the generator knows which file to crawl again?
> 
> oIs it looking at the fetch time?
> 
> oThe modified date as this can be misleading?
> 
> 
> 
> There is a modified date returned in most http headers and files on file
> system all have modified date which is the last modified date. How come it's
> not stored in the crawldb?
> 
> 
> 
> Here is an extract from my 2 crawls:
> 
> 
> 
> http://dmoz.org/Arts/   Version: 4
> 
> Status: 2 (DB_fetched)
> 
> Fetch time: Thu Feb 22 12:45:43 GMT 2007
> 
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> 
> Retries since fetch: 0
> 
> Retry interval: 30.0 days
> 
> Score: 0.013471641
> 
> Signature: fe52a0bcb1071070689d0f661c168648
> 
> Metadata: null
> 
> 
> 
> file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
> Version: 4
> 
> Status: 2 (DB_fetched)
> 
> Fetch time: Sat Feb 24 10:31:44 GMT 2007
> 
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> 
> Retries since fetch: 0
> 
> Retry interval: 30.0 days
> 
> Score: 1.1035091E-4
> 
> Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
> 
> Metadata: null
> 
> 
> 
> Looking forward to your reply.
> 
> 
> 
> Regards,
> 
> 
> 
> Armel
> 
> 
> 
> -
> 
> Armel T. Nene
> 
> iDNA Solutions
> 
> Tel: +44 (207) 257 6124
> 
> Mobile: +44 (788) 695 0483 
> 
>  http://blog.idna-solutions.com
> 
> 
> 
>

Re: Modified date in crawldb

2007-01-25 Thread chee wu
Armel,
   Sorry,I haven't tried this patch yet..

- Original Message - 
From: "Armel T. Nene" <[EMAIL PROTECTED]>
To: 
Sent: Thursday, January 25, 2007 11:07 PM
Subject: RE: Modified date in crawldb


> Chee,
> 
> Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
> version, was able to apply fully but not entirely successful in running with
> the XML parser plugin. If you have applied successfully let me know.
> 
> Regards,
> 
> Armel 
> -
> Armel T. Nene
> iDNA Solutions
> Tel: +44 (207) 257 6124
> Mobile: +44 (788) 695 0483 
> http://blog.idna-solutions.com
> 
> -Original Message-
> From: chee wu [mailto:[EMAIL PROTECTED] 
> Sent: 25 January 2007 13:44
> To: nutch-dev@lucene.apache.org
> Subject: Re: Modified date in crawldb
> 
> I also had this question a few days ago,and I am using Nutch0.8.1.It seems
> the "Modified data" will be used by Nutch-61, you can find detail at the
> link below: 
> http://issues.apache.org/jira/browse/NUTCH-61
> 
> I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
> this.
> 1.Retrieve all the Date information contained in the page content, Regular
> Expression is used to identify the date information.
> 2.Chose the newest date got as the page modified date.
> 3.Call  the method of  "setModifiedTime( )"  of the crawlDataum object in
> FetcherThread.Output( ).
> Maybe you can use a parse filter to separate this function from the core
> code.
> I am also new to Nutch, if  anything  wrong ,please feel free point out.
> 
> 
> - Original Message - 
> From: "Armel T. Nene" <[EMAIL PROTECTED]>
> To: 
> Sent: Thursday, January 25, 2007 7:52 PM
> Subject: Modified date in crawldb
> 
> 
>> Hi guys,
>> 
>> 
>> 
>> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
> actually
>> save the last modified date of files. I have run a crawl on my local file
>> system and the web. When I dumped the content of crawldb for both crawl,
> the
>> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
> it's
>> intended to be as is or if it's a bug. Therefore my question is:
>> 
>> 
>> 
>> * How does the generator knows which file to crawl again?
>> 
>> oIs it looking at the fetch time?
>> 
>> oThe modified date as this can be misleading?
>> 
>> 
>> 
>> There is a modified date returned in most http headers and files on file
>> system all have modified date which is the last modified date. How come
> it's
>> not stored in the crawldb?
>> 
>> 
>> 
>> Here is an extract from my 2 crawls:
>> 
>> 
>> 
>> http://dmoz.org/Arts/   Version: 4
>> 
>> Status: 2 (DB_fetched)
>> 
>> Fetch time: Thu Feb 22 12:45:43 GMT 2007
>> 
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> 
>> Retries since fetch: 0
>> 
>> Retry interval: 30.0 days
>> 
>> Score: 0.013471641
>> 
>> Signature: fe52a0bcb1071070689d0f661c168648
>> 
>> Metadata: null
>> 
>> 
>> 
>> file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
>> Version: 4
>> 
>> Status: 2 (DB_fetched)
>> 
>> Fetch time: Sat Feb 24 10:31:44 GMT 2007
>> 
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> 
>> Retries since fetch: 0
>> 
>> Retry interval: 30.0 days
>> 
>> Score: 1.1035091E-4
>> 
>> Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
>> 
>> Metadata: null
>> 
>> 
>> 
>> Looking forward to your reply.
>> 
>> 
>> 
>> Regards,
>> 
>> 
>> 
>> Armel
>> 
>> 
>> 
>> -
>> 
>> Armel T. Nene
>> 
>> iDNA Solutions
>> 
>> Tel: +44 (207) 257 6124
>> 
>> Mobile: +44 (788) 695 0483 
>> 
>> <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
>> 
>> 
>> 
>>
> 
>

Re: 'RegexIndexingFilter'

2007-01-29 Thread chee wu
I have had the same questions, and I think there should have a filed in the 
"Document"  Object to tell indexer just skip indexing,but I didn't find it.So I 
used a very rude way.Hope the other guys can provide a better method.
1. Set  the return Document to "null" in the method "filter(Document doc...)" 
in your own IndexingFilter.
 2. In the method "Indexer.reduce()" add some statements to deal with null doc 
right after the statements where filters were called. The modified cod 
fragments  might  be like this:
try {
   // run indexing filters
   doc = this.filters.filter(doc, parse, (UTF8) key, fetchDatum,
 inlinks);
  } catch (IndexingException e) {
   if (LOG.isWarnEnabled()) {
LOG.warn("Error indexing " + key + ": " + e);
   }
   return;
  }  
  if (doc == null) {
   if (LOG.isWarnEnabled()) {
LOG.warn("Skip indexing: " + key);
   }
   return;
  }


- Original Message - 
From: "Tobias Zahn" <[EMAIL PROTECTED]>
To: 
Sent: Tuesday, January 30, 2007 2:57 AM
Subject: 'RegexIndexingFilter'


> Good evening!
> I have found out that it is impossible to index only some specific file
> types with nutch. Needing this feature, I thought of implementing an
> 'RegexIndexingFilter', if that would be the right thing to do so.
> I have read some sourcecode, but I couldn't find out how to tell the
> indexer that he shouldn't index a file.
> 
> Hoping that I am on the right way I hope for your opinions, ideas and
> your help.
> 
> TIA,
> Tobias Zahn
>

Re: log4j problem

2007-01-31 Thread chee wu
set the two java arguments"-Dhadoop.log.file" and "-Dhadoop.log.dir" should fix 
your problem.
btw,not to put much chinese characters in your mail..
 

- Original Message - 
From: "kauu" <[EMAIL PROTECTED]>
To: 
Sent: Wednesday, January 31, 2007 1:45 PM
Subject: log4j problem


why when I changed the nutch/conf/log4j.properties

 

I just changed the first line 

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **


# RootLogger - DailyRollingFileAppender

#log4j.rootLogger=INFO,DRFA

log4j.rootLogger=DEBUG,DRFA

 

# Logging Threshold

log4j.threshhold=ALL

 

#special logging requirements for some commandline tools

log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout

log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout

log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout 

??

*  **
*

*  In the console ,it show me the error like below

 

 

 

log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (???)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.(Unknown Source)

at java.io.FileOutputStream.(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)

at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)

at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)

at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)

at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)

at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)

at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)

at org.apache.log4j.LogManager.(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)

at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)

at
org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)