I trying to make a crawler for g.o.o.g.l.e images. Normally in the browser, 
the search for image URL (in the Paste Image URL box) starts with a form 
GET request like this (/searchbyimage endpoint): here 
<https://www.google.com/searchbyimage?image_url=http%3A%2F%2Fimages.doba.com%2Fproducts%2F1440%2Fimages_LUM101040.jpg&image_content=&filename=&hl=en>
 which causes a 302 redirection to this url 
<https://www.google.com/search?tbs=sbi:AMhZZiuLE8z8qpT5adkxbYKWPOC4NcmZNgGtmQ0lXwkmqJaqsj_1GxeJPcCrUhYF77a6REEFyDZOYUEroBRLPaXhxnUnQg3w5GPo6k67-yhtLzfW855nCmvqtP-5E-Lg8Rm_1omIfz6ia6eCYn7hft-9jrEuyVWBM5ohPEN81NJD4iveDOI0Kc73UVgZ_1oLa2rX4HQdXMeyb5vJkyIoENjxEAXBtmiT5UlaPX7Wd0mnv43TePOVAKmtxD6M_1yDx453nVcLZNynUON20deX_1GPXbgS1uZdEyC7Y1vrge6hqipSvzV7Ku7AXjIxkQlOSoEubfn4bnoSJHaMIAxpzAj_1B8jBJu68vGa9L_1dQfx5qv0Ds8UfPeV7dtBQqDx360dJjaXpLaGz-GW2ajuVnjf1gJsmaG6ZBgRih6MXCrsjY8EOfLkSUxalmwQV2f_1qkVZ5eKtKTb0GpRjkPCB_1r0GWOfR4XDGBI5dcunZ-X1WubB4nDUYgS7uzfxAheGhhIgguRG5ZlYb1AQF6G9mgSJe2QR00s_1lNkX75CuOpfORnhdMZWxHrWoAS4cRCARahP520M3MQdj_1RYRXy6scQAsuZ6MCHF38-SAnrOYv38K1LwP00kY_16IAzdO6ODrzAGWNEGE-x5_1zhTq2iGM9zlPsIRDMm8YDM_1FogixK9LA_1HA3lDYbg1T_1pAHznd-WOKcz3k8sHX8uzJ8W2q7SpleC_1i8EpiNiX-Bg-2ZxsWdmuPC4JZcTaJplT91xup433JhCUeqmcdKtMcEe66X6GbnDHAXZVUkn2nvDH9X7ejOVMqL5aZhhnPHBmnN3LeHE-vlvHuKrv9pBCb1yQjHyGEVNo14R5C_18r_1kPzK8YWhXUsEfAPdrBddMXtPhbTwchFN-1jyq3uV05kzMa4WOMP_1ZrzN7ZgyLjNjNB_1xu7jF3Rd4FQTOaS8pmwLvFOJ1zn5H2HrLa5C5o0B5vJ9muYsK4YgDiNrPK7dTyb_10DhIwsx0pg1rdcyFah3p0SbHBQrrA8pttAsQK_10RuDXJUE3Ado13wWV7c_1BQKpYo1B6qL26hc0WhwGnWJcRMwyxUFDBRdp8ZZMM9H6uQOGSdvvyrREjTxfl8AYuYeNWn6Cgn-rp_1sf2IvFZ_1r97j0HNfTTWcsO3QuKyTyVBFUtHeWcgS2yskpd71V4_10-kXNf0gfvccsoXWzur1C_15Mr-5eCRq3v_1amw0Iy80ouAiHH0S1JPRUOqXYC9ZRRIEfwdtqeNZ9G5TX2ts53RFFZTRzy1TjIRrGFjQIkKYUUFdvJtnTbmLf-YJHD9opgp0feBAfc6kw&hl=en>
 *. *It's great and contains the results.

In my scrapy spider, and for some reason, I get an extra redirect 
<https://www.google.com/webhp?tbs=sbi:AMhZZivlWcp92ov7lZEOCEYbctkXlRLSRfKQCevH9sU6W5caCzZpbhf7zE8GXpEyIu2k7ms9ffF6GkaVdCD-G-ExEB
 
0efy2OO5wdWYFk-J3iwGROEhnBKqhuKw2PBhk16LVZ5gdheCJZwphx7-l--Em74wPKve-IXKYlpOYTNf9Ao3Gq00jFnPfd4eA8ZufDLxH4Ck4ZTWsAEKCojloRHLRdlKfafpJq7GHmDL2VlP1XMV13P9C3lTMn3eqAIPRSSpRzFCxwTMNKl8L0
 
EOe4xlx3CwDUMtXTD7xoosOiWSQpLNnpQL4SEVpBmaOdWid2gsiVPZHZWw4VIsFbWOgyGCGBi6K3jVoC3WlcnUNgui-lEC2H9QgPrBpSOd9pOel_1wvMLCtKBxtSF9May_1Y1Psai9tEG-FKcnALUFyF1FBB06Iu8k1NsU4212XVDB96r6ekRu
 
u0YTpHtbw-Kpl6uSFyKCljNQo6UQYVxRLUgkK_1kKYnPNEiE0nNlp-e5awfN4eTq8VSoIRLg5FnDBRUg7y7jxpunJERmJq-WjfG6p-aMyudegw8BDdtuoBmG-J-CNiJAS-5s2k6l_1xbTKwnVBm3c04oIEi9xlQBd86TKw2jFdl53yyq6sHyZQ
 
OXRlAPviSQ2p6G-Ec-_1Szli4HdMcsS13iX7viPvHeNzEH4Y11cgzTlh4h-4kVecrZNFNQIo9LDHkXGWH2EpNCy0j0N2wdL1mQnbUyXRoX1LxneIYzc9yEv3CODmHWUah9EWJSBjFGyaA6VdySPu_13ZICdpyiGjlDvdIRKkRPZuQYXkhxu_1M
 
sZgYr1OOamfCROavB0Eg8BYPqPJH2Q0ds7Be79gHbXV0IRpFZQiZVCCi6wstjPpYNFTN5qhXnPR34PJzqia-EME5bym68vAyOVZzD2i_17KjJLEwlnSsrQmGuwyIgfgwt8PIYglKFQIMpApGybCE2pN2waVtLIXFxNGh8ONvs65Sb2OFm0slt9
 
NscBRDmYh2JSj31-VaxiG3MMz2qWll8Igg9EXsa1e7lhpSSSK3eoPnl3BzpV0NGGEcT1tNHITHm7ZQLBvSk53LeJxcxOFe_1SYCmrXP2lcvvpyt5HU2obnrYerpbt1Hzh4vdxoequZVMyNzKi61x7BVPlJ-jDe36JtDATgQKBsdzlCymi1J47t
 
9GiWTaszljarBbAv7RxATowQ-Ygu-chSdGqGj_1ToITxwApw0852rPC1XZf3RHHahjIeRdMY7wpT-s028TYxTohBdT6N6LbP7p-EBQoQfzK36lx2R922UtHybflno4TBGbbmL0e3uBTD7IiEOBuxOyCNiIkZHIKgf4WmtVRgvCruzmxBqWiCvV
 
QoAroiD_1fFPFhVvtAWABI3G-DhIpuFiPCmJp5RoAyf0XodHnopqeeg6YxRls7Mda8jB9KqgCW7NzBfNRkiqQZAuQ&hl=en>
 (to 
a /webhp endpoint) which messes up the whole thing because it contains no 
results.

I tried to limit the number of redirections, via the REDIRECT_MAX_TIMES = 1 
setting, 
but this prevents the first redirect and I get no redirections at  all:
[scrapy] DEBUG: Discarding <GET 
https://www.google.com/search?tbs=sbi:AMhZZitIqvtVdD_1vRxr0X5roQxKdG_1m4s20s2oRQ4EJ27UR9zPid0Kw4Y5tTDCHIBJKPK9BR4a5QWppMkuAp1Uf6CC
UmAY0HD3hLkNyXfbRfd6TYMhAKeKUVQOZpF9aOG1yOfQXerVkex9N6iS4pcDeFkdnzxxO8Mu3kMAP_1EZ2_1VfjWl_1EIvbQ4P6SWmOMOV9_1-I9UfhX_1MSpQekZncaq9MKNRMGNNZjbyK5oba54q0_14qQ6qZ4AJPbLPTbTg8NlyuCX4wuOp
goHUwu4ZeepnKRhdGuozDdwID25Yf7Pw0PxYNKpwkBbDYSnJW9FLrkEbOUvkW7xxSBlTqALzuxmq808VktBh0DDu0UEOktvkQJw708HEhjlcceLV0GnwiMqngjips-OWs6j9qeHhZESRmbz9G5fI_1bk9T7XLfqrO_1vsnPbflExCNfj-LUdDc
cAXhV5tnTbi7U4X6scB1mDZpK9kQLy9u7jNQXEwPCL8Dyc6Z6wE5_1GEYF9E_1A3gJKHuNX-WDhXwlMcNmvQxQridGLWbx5cJW0tSm0HQDwABFA45eMIQYBfzKMKQ8HSIfjYTAnqSfCeBgeVmcdoNDLwSAoID1XFcvjhaW6QTWQ50H9nKKqO9n
-_17Ewb4YeKc16oo9NzjHIDTw9KjWPAbKwHn322y3c9a-_1UDy0_1syqObu95txeWl0zNU7tP9kaYhxjHMkKXladjbgBoqfIgpRh4l2WXzjxPOAiwvFkdcP-AMoMOVV0v3doUHPTOV3zQ6iwiLe3JWBdJp0d6dOi4rfSmUsIh1KoUZA9f7hhgQ
-QzRsOJ5Hrz5mbYL4cVi6MRrPJbLQAykeDZz5FNyFVaFksApLOfj0dR15YV4Lxv7F_1MkI20CgC44l2Em6yLCoAlvcuIJOQMcke1iC-M8mkE99jgZtLqiBvycHKwIxkrtcUXHcNduIyGZaBZWCSID1VbhSXNd5BuNC33fXsL0fxwzPSK1j1Psu
sxKTxWHTR5igyY4R7Id8UbE8T60Yvf1pXRlSWTwYJdcA1Yzef3-mXBk1WVh7LM7IMFZz79M6YleiURw8TE6mkwRXfhRYgNX6y7sMIQqtgP9flaijT3XWkWLk1skP3eNEdjser98aX6xYyJX-77mx7daRONRFl7x6niFmJl9zkZzIXICV6uGF0J
ZJ-vN_1vqhcDkIL3u7etbbBl6e9vnegvYdjVPsFrDyW2DjNCHYVWuImYYOVpqBtMWSTWe_1XzmAOM9zj2tWT1_15Gj1RgDzyJUTpkkH9M02IyGomrXMe5IcjenAi2qxvYON44G0-TwVyQtCyLzJUBqXKBHy0nskiAQVzUl4arFzsLbZE-MmuzO
V1Hvp0oiAfny2aoAOGBJh-GY4RuIzYgePKrpEpzkXSfz68ZuuzZo&hl=en>: max 
redirections reached

Setting it to 2 causes the /webhp redirect to happen again. (does anyone 
see a bug here?)

I also tried using splash with scrapyjs, but it seems the splash browser 
misses the whole point. This url 
<https://www.google.com/search?tbs=sbi:AMhZZivafWV8IiIUYyJ5FYnYsp2tbeu99uRC5lAdhhNRs4v8S8CM7dHuk9tVk4A3U2Lr_1wmjutPR_1GwdGgFNJh-Uw4DdFlBCjii1SCA89eBwbtToL-6-3rnk4sFtyUetfyf3afEs5avaWx0nUlaZcNkGRMKi75sK4KQpQ3G8TV-0gJqMJGVTf4oP7Fwzs8PmIA7by8lUt-R7bCbg8zAkDCj0AHOW_1nxKy8Cw7mAvx122SUhDlTCySNR6tFtGBSr5rAVOsZxjpHZ-YndU-tkPqm8bWs4XaumDe2kwSFgV6BWZqohBw5qba8LZOCb-k_1AGb2Jj0eG7S2OvEdk8kdHJnEz6q2SVPgevKRXqkumuW9WdbhrVELCPfXRI_1DdpPowOYvi40txgY4so6u17L6aOO8dM_1B2Wd_1v97qEUzkOa5MAhODSqnqPXqCU8FwXoeG4883U73xkd3t3H0kVBx8UoovQ9jAbxzVTPJC6aNoDylM9DTbjOjZ4UP-1WYB-z5v-AcZRTwDuhOT4wJTqMPjqPAZmd5E-GISb6eV91s0YnBQ9UXZQYPmTP-NlcdDG8XYOvjetZa8gdgdfhd1MxZB2uArRcPi1ZRLArgvn2VeRd2hi81S3PkbTwh74RnXVwygKka8axj9t29c55gSdKdlUn68TVK-t6yyq4zQlTw2ytkAClZJRAuauE_1XzsImnBzGX18ewA6PaCNOa5AaAS6JDWNZquc6c1MvISsU1zuNk6Wa3LEZhtfbKELzJh2ujwgsQp9_1cOrTa9KT76KF_15xOdEOJ5QIY27EAKaNaUjvuW3Lh7wB0JW66dS2DwICDylFVIcygs6lQFzIFXiy9rErPZZ7WHHttYswB97BzlOmHwX1hqqzFUHhZ-lWjeslAMJQBMvquhoHao3wyfO94NkKbmeGRu0ysKbbGKtGwLZ_1LpTXPoBXpIDXeJzZCdaU9cbaJOBa0J09y9mvs9C5ccuRHhY_10Q6nMxqVjZa3hhQV9MuqD5jKOAsipLw4drhB4S8xitogEis9sV-siEcKrLBjmX663EcqDp0FTy6hQZi1omal-5KcboVMKq5Ww29DmJNg6HPKDMITHcxte81IG3CikjGKGHaeGMlrZJXD96sl8VYTooxSGAOKSYGQu9vdXijVSAQguj7xBC72kfvy3AMpoD_1nRKP_1_1sAP7PcYDX6vgb3eHbv56ELc4ZNAlUcDBqIHPegNaRFfpVDK87E37c1K5D2pDQ5VrLQjZHsoO7ofDgtgtwtBaGf4ddfaCQ7mS3dqocn9OD0xYUYU0lxuBH5lqklbjs3Pgg9al38ZQ&hl=en>
 which 
works fine in a normal browser, fails the *Render!* button test.

Any help with limiting redirections or making splash work is much 
appreciated.
Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to