Hi Jiaxin, In *HttpResponse.java*, you can check the 'Content-Type' header and then decide whether to:
- Set the response content to be the binary http response. (Check out protocol-httpclient's source code for hints) or - Continue executing *readPlainContent(url)*, which in turn will set the 'content' from the html body by Selenium Firefox driver. By the way, since nutch-selenium will be looking for the html body, I think we should check for 'text/html' and 'application/xhtml+xml' content types, not just anything that starts with 'text/.....' Best regards, Mohammad Al-Mohsin On Sat, Feb 21, 2015 at 12:05 PM, Jiaxin Ye <jiaxi...@usc.edu> wrote: > Hi Mohammad, > > Hey, I think that's a very good idea! Any hints about how to change the > selenium plugin? I am thinking about the same thing but struggling on how > to do it. > > Best, > Jiaxin > > On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <m...@mem9.net> wrote: > >> I am using nutch-selenium <https://github.com/momer/nutch-selenium> >> plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR> >> installed for parsing text off images. >> >> While crawling with Nutch & selenium, I noticed that binary data (e.g. >> images, pdf) are always truncated and thus skip/fail parsing. Here is a >> sample of the log: >> >> *Content of size 800750 was truncated to 368. Content is truncated, parse >> may fail!* >> When I turn selenium off, parsing works fine and the content is not >> truncated. >> >> I found that nutch-selenium gets the html body of whatever Firefox >> displays. So even though you're fetching an image, selenium will just give >> you the image html tag instead of the image itself. >> e.g. <img src='xyz.png' height="400" width="600"> >> >> To get around this, I modified selenium plugin to handle the fetch only >> if the Content-Type header starts with 'text', i.e. to catch 'text/html'. >> Otherwise, if the content is not textual, it just returns the content as >> protocol-httpclient does. >> >> Now, I am getting binary data properly parsed and also getting selenium >> handle page rendering with javascript. >> >> Is this is the proper way to tackle this? what do you think? >> >> >> Best regards, >> Mohammad Al-Mohsin >> > >