Re: Nutch-Selenium Plugin Truncates Binary Data

Jiaxin Ye Sat, 21 Feb 2015 12:09:07 -0800

Hi Mohammad,

Hey, I think that's a very good idea! Any hints about how to change the
selenium plugin? I am thinking about the same thing but struggling on how
to do it.


Best,
Jiaxin

On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <m...@mem9.net> wrote:

> I am using nutch-selenium <https://github.com/momer/nutch-selenium>
> plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR>
> installed for parsing text off images.
>
> While crawling with Nutch & selenium, I noticed that binary data (e.g.
> images, pdf) are always truncated and thus skip/fail parsing. Here is a
> sample of the log:
>
> *Content of size 800750 was truncated to 368. Content is truncated, parse
> may fail!*
> When I turn selenium off, parsing works fine and the content is not
> truncated.
>
> I found that nutch-selenium gets the html body of whatever Firefox
> displays. So even though you're fetching an image, selenium will just give
> you the image html tag instead of the image itself.
> e.g. <img src='xyz.png' height="400" width="600">
>
> To get around this, I modified selenium plugin to handle the fetch only if
> the Content-Type header starts with 'text', i.e. to catch 'text/html'.
> Otherwise, if the content is not textual, it just returns the content as
> protocol-httpclient does.
>
> Now, I am getting binary data properly parsed and also getting selenium
> handle page rendering with javascript.
>
> Is this is the proper way to tackle this? what do you think?
>
>
> Best regards,
> Mohammad Al-Mohsin
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Reply via email to