Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: > > yes, its permanently redirected. > you can check also the segment status of this url > here is an example > > reinh...@thord:>bin/nutch readseg -get crawl/segments/20091028122455 > "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20" > > it will show you whether it is parsed and the extracted outlinks. > it will show any data related to this url stored in the segment. > > regards > > caezar schrieb: >> Thanks, that was really helpful. I've moved forward but still not found >> the >> solution. >> So the status of the initial URL >> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) >> is: >> Status: 5 (db_redir_perm) >> Metadata: _pst_: moved(12), lastModified=0: >> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm >> >> So it answers the question, why initial page was not indexed - because it >> was redirected. >> Now checking the status of redirect target: >> Status: 2 (db_fetched) >> >> So it was sucessfully fetchet. But, according to indexing log - it still >> was >> not sent to indexer! >> >> >> >> reinhard schwab wrote: >> >>> what is the db status of this url in your crawl db? >>> if it is STATUS_DB_NOTMODIFIED, >>> then it may be the reason. >>> (you can check it if you dump your crawl db with >>> reinh...@thord:>bin/nutch readdb <crawldb> -url <url> >>> >>> it has this status, if it is recrawled and the signature does not >>> change. >>> the signature is MD5 hash of the content. >>> >>> another reason may be that you have some indexing filters. >>> i dont believe its the reason here. >>> >>> regards >>> >>> >>> kevin chen schrieb: >>> >>>> I have similar experience. >>>> >>>> Reinhard schwab responded a possible fix. See mail in this group from >>>> Reinhard schwab at >>>> Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) >>>> >>>> I haven't have chance to try it out yet. >>>> >>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: >>>> >>>> >>>>> Hi All, >>>>> >>>>> I've got a strange problem, that nutch indexes much less URLs then it >>>>> fetches. For example URL: >>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. >>>>> I assume that if fetched sucessfully because in fetch logs it >>>>> mentioned >>>>> only >>>>> once: >>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: >>>>> fetching >>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm >>>>> >>>>> But it was not sent to the indexer on indexing phase (I'm using custom >>>>> NutchIndexWriter and it logs every page for witch it's write method >>>>> executed). What could be possible reason? Is there a way to browse >>>>> crawldb >>>>> to ensure that page really fetched? What else could I check? >>>>> >>>>> Thanks >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > >
-- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html Sent from the Nutch - User mailing list archive at Nabble.com.