Re: Nutch-Selenium Plugin Truncates Binary Data

Mohammad Al-Mohsin Sat, 21 Feb 2015 19:59:23 -0800

Hi Jiaxin,

In *HttpResponse.java*, you can check the 'Content-Type' header and then
decide whether to:


- Set the response content to be the binary http response. (Check out
protocol-httpclient's source code for hints)
or
- Continue executing *readPlainContent(url)*, which in turn will set the
'content' from the html body by Selenium Firefox driver.

By the way, since nutch-selenium will be looking for the html body, I think
we should check for 'text/html' and 'application/xhtml+xml' content types,
not just anything that starts with 'text/.....'


Best regards,
Mohammad Al-Mohsin

On Sat, Feb 21, 2015 at 12:05 PM, Jiaxin Ye <jiaxi...@usc.edu> wrote:

> Hi Mohammad,
>
> Hey, I think that's a very good idea! Any hints about how to change the
> selenium plugin? I am thinking about the same thing but struggling on how
> to do it.
>
> Best,
> Jiaxin
>
> On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <m...@mem9.net> wrote:
>
>> I am using nutch-selenium <https://github.com/momer/nutch-selenium>
>> plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR>
>> installed for parsing text off images.
>>
>> While crawling with Nutch & selenium, I noticed that binary data (e.g.
>> images, pdf) are always truncated and thus skip/fail parsing. Here is a
>> sample of the log:
>>
>> *Content of size 800750 was truncated to 368. Content is truncated, parse
>> may fail!*
>> When I turn selenium off, parsing works fine and the content is not
>> truncated.
>>
>> I found that nutch-selenium gets the html body of whatever Firefox
>> displays. So even though you're fetching an image, selenium will just give
>> you the image html tag instead of the image itself.
>> e.g. <img src='xyz.png' height="400" width="600">
>>
>> To get around this, I modified selenium plugin to handle the fetch only
>> if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>> Otherwise, if the content is not textual, it just returns the content as
>> protocol-httpclient does.
>>
>> Now, I am getting binary data properly parsed and also getting selenium
>> handle page rendering with javascript.
>>
>> Is this is the proper way to tackle this? what do you think?
>>
>>
>> Best regards,
>> Mohammad Al-Mohsin
>>
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Reply via email to