I am using nutch-selenium <https://github.com/momer/nutch-selenium> plugin
and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR> installed
for parsing text off images.

While crawling with Nutch & selenium, I noticed that binary data (e.g.
images, pdf) are always truncated and thus skip/fail parsing. Here is a
sample of the log:

*Content of size 800750 was truncated to 368. Content is truncated, parse
may fail!*
When I turn selenium off, parsing works fine and the content is not
truncated.

I found that nutch-selenium gets the html body of whatever Firefox
displays. So even though you're fetching an image, selenium will just give
you the image html tag instead of the image itself.
e.g. <img src='xyz.png' height="400" width="600">

To get around this, I modified selenium plugin to handle the fetch only if
the Content-Type header starts with 'text', i.e. to catch 'text/html'.
Otherwise, if the content is not textual, it just returns the content as
protocol-httpclient does.

Now, I am getting binary data properly parsed and also getting selenium
handle page rendering with javascript.

Is this is the proper way to tackle this? what do you think?


Best regards,
Mohammad Al-Mohsin

Reply via email to