nutch fetch issue - empty content

Viral Shah Fri, 12 Sep 2008 15:32:37 -0700

Hello --

We are using Nutch to crawl html content for Wikipedia articles. Weuse static list urls as an input. To do this we've injected our listof urls, set db.update.additions.allowed to false, and set the crawldepth to 1.

- We iterate over the output segment files using'SequenceFile.Reader' and pullout the 'string' as well as 'binary'form of content.

        
                reader = SequenceFile.Reader(filesystem, Path(sys.argv[1]), job)
                key = reader.getKeyClass()()
                content = reader.getValueClass()()
                while reader.next(key, content):
                        content_text = String(content.getContent(), 
"UTF-8").toString()
                        content_binary = content.getContent()

- I get empty content for some urls but the status in crawldb is setto 'db_fetched'.The value of content_text is "" and that of content_binary isarray('b',[])

- This is completely random in terms of when it happens and the urlsinvolved.

- This failure is completely silent as far as I can tell as nothingcan be seen in logs regarding this error.

Again, we are crawling wikipedia which is verifiable for it's contentand whether that content is accessible. We have tried manually gettingthe problem urls and everything looked fine.


Thank you,
Viral Shah

nutch fetch issue - empty content

Reply via email to