Just for the record for those who encounter this in the future, I found a
solution. BTW, I forgot to say I'm using version 0.24.6. Here's how I
forced a response into being an HtmlResponse type (seen here -
http://git.io/v3zoP - and very slightly adapted):
def parse(self, response):
# Scrapy doesn't return an HtmlResponse for some sites which makes
loading items fail
# This forces the response to be HtmlResponse type
# As seen here http://git.io/v3zoP
if response.status == 200 and not isinstance(response,
HtmlResponse):
try:
flags = response.flags
if "partial" in flags:
flags.remove('partial')
flags.append('fixed')
response = HtmlResponse(response.url,
headers=response.headers,
body=response.body,
flags=flags,
request=response.request)
log.msg('Response transformed into HtmlResponse for %s' %
response.url, level=log.WARNING)
except:
pass
l = WaLoader(item=WaItem(), response=response)
I was able to go as far as this
- https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py -
regarding where the response type is decided but I wasn't able to figure
out why in this case it didn't return as an HtmlResponse.
Cheers
soundjack
On Tuesday, August 11, 2015 at 7:40:37 PM UTC+1, soundjack wrote:
>
> Hi,
>
> I'm in the process of developing a spider that will run through some 320k
> different URLs and while doing so I'm finding different situations. Right
> now I have some cases where Scrapy doesn't seem to detect the correct type
> of response and returns a Response object instead of an HtmlResponse one
> (I've been here:
> http://doc.scrapy.org/en/latest/topics/request-response.html?#response-objects
> ).
>
> In my parse method I'm actually not selecting anything from the page. The
> purpose of this spider is to send the whole body of the request to the
> Wappalyzer library (https://github.com/scrapinghub/wappalyzer-python/) to
> detect apps and technologies used. I'm just taking advantage of the Scrapy
> architecture for the crawling part, instead of building my own.
>
> Here's an example of a website where this happens:
> http://boucheriesaintroch.webs.com. If you do scrapy shell
> http://boucheriesaintroch.webs.com:
>
> In [1]: type(response)
>> Out[1]: scrapy.http.response.Response
>> In [2]: response.headers
>> Out[2]:
>> {'Date': 'Tue, 11 Aug 2015 16:57:27 GMT',
>> 'Server': 'Webs.com/1.0',
>> 'Set-Cookie':
>> 'fwww=b4e6b552bf12b31f11fd753117ad163ea80e738c7fe8587bfd2eebc489eb9921;
>> Path=/',
>> 'X-Robots-Tag': 'nofollow'}
>
>
> In Chrome dev tools you can see that the page reports itself as text/html
> with ISO-8859-1 encoding:
>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
>
> So my question is, why doesn't scrapy give me an HtmlResponse object back
> and how can I fix this?
>
> Thanks for any tips!
>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.