Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche
Hi What happens is that parse-tika is used by default but doesn't know what to do with that mime type. You can edit parse-plugins.xml and add to map the mime type to the html parser. Obviously you'll need parse-html to be a

Nutch not recognizing html pages/images retrieved via php

2015-10-03 Thread Girish Rao
Hi, I am running a crawl on a website that serves pages and images via php. Nutch doesn’t seem to crawl these pages. I see the below in the hadoop.log 015-10-03 12:48:31,091 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.incl