Hi,

I'm in the process of developing a spider that will run through some 320k 
different URLs and while doing so I'm finding different situations. Right 
now I have some cases where Scrapy doesn't seem to detect the correct type 
of response and returns a Response object instead of an HtmlResponse one 
(I've been here: 
http://doc.scrapy.org/en/latest/topics/request-response.html?#response-objects
).

In my parse method I'm actually not selecting anything from the page. The 
purpose of this spider is to send the whole body of the request to the 
Wappalyzer library (https://github.com/scrapinghub/wappalyzer-python/) to 
detect apps and technologies used. I'm just taking advantage of the Scrapy 
architecture for the crawling part, instead of building my own.

Here's an example of a website where this happens: 
http://boucheriesaintroch.webs.com. If you do scrapy shell 
http://boucheriesaintroch.webs.com:

In [1]: type(response)
> Out[1]: scrapy.http.response.Response
> In [2]: response.headers
> Out[2]:
> {'Date': 'Tue, 11 Aug 2015 16:57:27 GMT',
>  'Server': 'Webs.com/1.0',
>  'Set-Cookie': 
> 'fwww=b4e6b552bf12b31f11fd753117ad163ea80e738c7fe8587bfd2eebc489eb9921; 
> Path=/',
>  'X-Robots-Tag': 'nofollow'}


In Chrome dev tools you can see that the page reports itself as text/html 
with ISO-8859-1 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

So my question is, why doesn't scrapy give me an HtmlResponse object back 
and how can I fix this?

Thanks for any tips!


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to