>
> Try a real USER_AGENT setting.
>
>
>
> I added the following to settings.py, which I pulled from Fiddler on my
Windows desktop:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Same result:
2016-06-03 20:37:19.777710 [-] Splash version: 2.1
2016-06-03 20:37:19.781558 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17,
Twisted 16.1.1, Lua 5.2
2016-06-03 20:37:19.781864 [-] Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4]
2016-06-03 20:37:19.782499 [-] Open files limit: 1048576
2016-06-03 20:37:19.782676 [-] Can't bump open files limit
2016-06-03 20:37:19.903300 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0',
'1024x768x24']
2016-06-03 20:37:20.115657 [-] proxy profiles support is enabled, proxy
profiles path: /etc/splash/proxy-profiles
2016-06-03 20:37:20.319444 [-] verbosity=1
2016-06-03 20:37:20.319719 [-] slots=50
2016-06-03 20:37:20.320095 [-] argument_cache_max_entries=500
2016-06-03 20:37:20.320618 [-] Web UI: enabled, Lua: enabled (sandbox:
enabled)
2016-06-03 20:37:20.323905 [-] Site starting on 8050
2016-06-03 20:37:20.324129 [-] Starting factory <twisted.web.server.Site
object at 0x7f279ce6fe48>
2016-06-03 20:37:24.992726 [-] "172.17.0.1" - - [03/Jun/2016:20:37:24
+0000] "GET /robots.txt HTTP/1.1" 404 153 "-" "Mozilla/5.0 (Windows NT
10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36"
process 1: D-Bus library appears to be incorrectly set up; failed to read
machine uuid: Failed to open "/etc/machine-id": No such file or directory
See the manual page for dbus-uuidgen to correct this issue.
2016-06-03 20:37:35.964753 [events] {"timestamp": 1464986255, "_id":
139808112914160, "active": 0, "args": {"iframes": true, "headers":
{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Referer": "https://sapui5.hana.ondemand.com/", "Accept-Language": "en",
"Accept":
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate"}, "uid": 139808112914160, "wait": 10.0,
"html": true, "url":
"https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html"},
"maxrss": 79608, "fds": 19, "client_ip": "172.17.0.1", "path":
"/render.json", "status_code": 200, "rendertime": 10.95194411277771,
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36", "method": "POST",
"qsize": 0, "load": [0.09, 0.08, 0.06]}
2016-06-03 20:37:35.967636 [-] "172.17.0.1" - - [03/Jun/2016:20:37:35
+0000] "POST /render.json HTTP/1.1" 200 13662 "-" "Mozilla/5.0 (Windows NT
10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36"
Instead of showing the error that the Index is out of bounds, I changed
parse_page to:
def parse_page(self, response):
if 'response' in locals():
print('response is defined')
else:
print('Ooops: response is not defined')
return
print response
if 'response.data' in locals():
print('response.data is defined')
else:
print('Ooops: response.data is not defined')
return
print response.data
print('Len response.data:'.len(response.data))
if 'childFrames' in response.data.keys():
print('There is a childFrame')
else:
print('Ooops: no childFrames')
return
if len(response.data['childFrames']) > 0:
print('There is childFrame 0')
else:
print('Ooops: no childFrame 0')
return
print('Len first child:'.len(response.data['childFrames'][0]))
print('Len html:'.len(response.data['childFrames'][0]['html']))
iframe_html = response.data['childFrames'][0]['html']
And get response.data is not defined.
2016-06-03 13:37:24 [scrapy] DEBUG: Crawled (200) <GET
https://sapui5.hana.ondemand.com/>
(referer: None)
2016-06-03 13:37:24 [scrapy] DEBUG: Crawled (404) <GET
http://localhost:8050/robots.txt>
(referer: None)
2016-06-03 13:37:36 [scrapy] DEBUG: Crawled (200) <GET
https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html
via http://localhost:8050/render.json> (referer: None)
response is defined
<200 https://sapui5.hana.ondemand.com/sdk/#docs/api/symbols/sap.html>
Ooops: response.data is not defined
2016-06-03 13:37:36 [scrapy] INFO: Closing spider (finished)
Not really sure what that is indicating.
Is that pointing to a Splash problem?
As html should have been returned, just as it was from curl.
Just trying to figure out where the moving parts are to drill into.
Thanks,
David
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.