Hi Mohammad, Hey, I think that's a very good idea! Any hints about how to change the selenium plugin? I am thinking about the same thing but struggling on how to do it.
Best, Jiaxin On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <m...@mem9.net> wrote: > I am using nutch-selenium <https://github.com/momer/nutch-selenium> > plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR> > installed for parsing text off images. > > While crawling with Nutch & selenium, I noticed that binary data (e.g. > images, pdf) are always truncated and thus skip/fail parsing. Here is a > sample of the log: > > *Content of size 800750 was truncated to 368. Content is truncated, parse > may fail!* > When I turn selenium off, parsing works fine and the content is not > truncated. > > I found that nutch-selenium gets the html body of whatever Firefox > displays. So even though you're fetching an image, selenium will just give > you the image html tag instead of the image itself. > e.g. <img src='xyz.png' height="400" width="600"> > > To get around this, I modified selenium plugin to handle the fetch only if > the Content-Type header starts with 'text', i.e. to catch 'text/html'. > Otherwise, if the content is not textual, it just returns the content as > protocol-httpclient does. > > Now, I am getting binary data properly parsed and also getting selenium > handle page rendering with javascript. > > Is this is the proper way to tackle this? what do you think? > > > Best regards, > Mohammad Al-Mohsin >