Just for the record for those who encounter this in the future, I found a 
solution. BTW, I forgot to say I'm using version 0.24.6. Here's how I 
forced a response into being an HtmlResponse type (seen here - 
http://git.io/v3zoP - and very slightly adapted):

    def parse(self, response):
        # Scrapy doesn't return an HtmlResponse for some sites which makes 
loading items fail
        # This forces the response to be HtmlResponse type
        # As seen here http://git.io/v3zoP
        if response.status == 200 and not isinstance(response, 
HtmlResponse):
            try:
                flags = response.flags
                if "partial" in flags:
                    flags.remove('partial')
                flags.append('fixed')
                response = HtmlResponse(response.url,
                                        headers=response.headers,
                                        body=response.body,
                                        flags=flags,
                                        request=response.request)
                log.msg('Response transformed into HtmlResponse for %s' % 
response.url, level=log.WARNING)
            except:
                pass

        l = WaLoader(item=WaItem(), response=response)

I was able to go as far as this 
- https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py - 
regarding where the response type is decided but I wasn't able to figure 
out why in this case it didn't return as an HtmlResponse. 

Cheers
soundjack

On Tuesday, August 11, 2015 at 7:40:37 PM UTC+1, soundjack wrote:
>
> Hi,
>
> I'm in the process of developing a spider that will run through some 320k 
> different URLs and while doing so I'm finding different situations. Right 
> now I have some cases where Scrapy doesn't seem to detect the correct type 
> of response and returns a Response object instead of an HtmlResponse one 
> (I've been here: 
> http://doc.scrapy.org/en/latest/topics/request-response.html?#response-objects
> ).
>
> In my parse method I'm actually not selecting anything from the page. The 
> purpose of this spider is to send the whole body of the request to the 
> Wappalyzer library (https://github.com/scrapinghub/wappalyzer-python/) to 
> detect apps and technologies used. I'm just taking advantage of the Scrapy 
> architecture for the crawling part, instead of building my own.
>
> Here's an example of a website where this happens: 
> http://boucheriesaintroch.webs.com. If you do scrapy shell 
> http://boucheriesaintroch.webs.com:
>
> In [1]: type(response)
>> Out[1]: scrapy.http.response.Response
>> In [2]: response.headers
>> Out[2]:
>> {'Date': 'Tue, 11 Aug 2015 16:57:27 GMT',
>>  'Server': 'Webs.com/1.0',
>>  'Set-Cookie': 
>> 'fwww=b4e6b552bf12b31f11fd753117ad163ea80e738c7fe8587bfd2eebc489eb9921; 
>> Path=/',
>>  'X-Robots-Tag': 'nofollow'}
>
>
> In Chrome dev tools you can see that the page reports itself as text/html 
> with ISO-8859-1 encoding:
>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
>
> So my question is, why doesn't scrapy give me an HtmlResponse object back 
> and how can I fix this?
>
> Thanks for any tips!
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to