Hi, I'm in the process of developing a spider that will run through some 320k different URLs and while doing so I'm finding different situations. Right now I have some cases where Scrapy doesn't seem to detect the correct type of response and returns a Response object instead of an HtmlResponse one (I've been here: http://doc.scrapy.org/en/latest/topics/request-response.html?#response-objects ).
In my parse method I'm actually not selecting anything from the page. The purpose of this spider is to send the whole body of the request to the Wappalyzer library (https://github.com/scrapinghub/wappalyzer-python/) to detect apps and technologies used. I'm just taking advantage of the Scrapy architecture for the crawling part, instead of building my own. Here's an example of a website where this happens: http://boucheriesaintroch.webs.com. If you do scrapy shell http://boucheriesaintroch.webs.com: In [1]: type(response) > Out[1]: scrapy.http.response.Response > In [2]: response.headers > Out[2]: > {'Date': 'Tue, 11 Aug 2015 16:57:27 GMT', > 'Server': 'Webs.com/1.0', > 'Set-Cookie': > 'fwww=b4e6b552bf12b31f11fd753117ad163ea80e738c7fe8587bfd2eebc489eb9921; > Path=/', > 'X-Robots-Tag': 'nofollow'} In Chrome dev tools you can see that the page reports itself as text/html with ISO-8859-1 encoding: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> So my question is, why doesn't scrapy give me an HtmlResponse object back and how can I fix this? Thanks for any tips! -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
