Piyush, Took a look, looks like you can use these APIs. I'm providing the curl requests. You can copy-paste them to https://curlconverter.com/ to convert it into language of your choice :)
1. Get the blocks of a district curl ' https://missionantyodaya.nic.in/getPreLoginAnalyticsData.html?stateCode=6&districtCode=61' \ -X 'POST' \ -H 'Connection: keep-alive' \ -H 'Content-Length: 0' \ -H 'Pragma: no-cache' \ -H 'Cache-Control: no-cache' \ -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \ -H 'Accept: */*' \ -H 'Content-Type: application/json' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \ -H 'sec-ch-ua-platform: "Linux"' \ -H 'Origin: https://missionantyodaya.nic.in' \ -H 'Sec-Fetch-Site: same-origin' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Referer: https://missionantyodaya.nic.in/preloginAnalytics2020.html' \ -H 'Accept-Language: en-US,en;q=0.9' \ -H 'Cookie: JSESSIONID=obT6zCBsqbClJdpkAhrHxIbVaNog5IcQNt1WerzF.nqj1p-lxapp8-001' \ --compressed 2. Get all the metrics for block curl ' https://missionantyodaya.nic.in/getPreLoginAnalyticsData.html?stateCode=6&districtCode=61&blockCode=469' \ -X 'POST' \ -H 'Connection: keep-alive' \ -H 'Content-Length: 0' \ -H 'Pragma: no-cache' \ -H 'Cache-Control: no-cache' \ -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \ -H 'Accept: */*' \ -H 'Content-Type: application/json' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \ -H 'sec-ch-ua-platform: "Linux"' \ -H 'Origin: https://missionantyodaya.nic.in' \ -H 'Sec-Fetch-Site: same-origin' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Referer: https://missionantyodaya.nic.in/preloginAnalytics2020.html' \ -H 'Accept-Language: en-US,en;q=0.9' \ -H 'Cookie: JSESSIONID=obT6zCBsqbClJdpkAhrHxIbVaNog5IcQNt1WerzF.nqj1p-lxapp8-001' \ --compressed I basically went to the analytics tab and looked for the API's being called. On Fri, Feb 4, 2022 at 12:12 AM Sanjay Bhangar <sanjaybhan...@gmail.com> wrote: > Piyush - > > You could write a python (or your preferred language) script that just > requests the HTML, parses it, and follows the hierarchy, without using > selenium. This could be a bunch of work as the site doesn't use regular > links with GET requests, but rather when you click on a state in the table, > it uses Javascript to fill up hidden form fields with the state code, etc. > and then does a form submit, causing a POST request to be made with those > values. > > For eg. you can see the links in the table have an onClick handler like > "selectState(2,'HIMACHAL > PRADESH','preloginDistrictInfrastructureReports2020.html')" . > > Then, in the javascript, you can see the selectState function defined like > so: > > function selectState(stateCode,stateName,action){ > $("#stateCode").val(stateCode); > $("#stateName").val(stateName); > $("#reportForm").attr('action', action); > $("#reportForm").submit(); > > } > > In this JS file: > https://missionantyodaya.nic.in/resources/antyodaya/js/custom/prelogin/reports/preloginReport.js > > So this will make a POST request to > preloginDistrictInfrastructureReports2020.html > with stateCode=2, stateName=HIMACHAL PRADESH > > Similarly, there are different onCick handlers defined for selecting > districts, etc. that you can follow down to see what URLs they are calling > with what parameters. And in theory, you could write some HTML parsing code > and some regex to go through the items in each table, parse out the > parameters and URLs to call, and follow things down. > > So, in theory you could write this without mucking around with selenium, > but it also seems like a lot more work than if the site was structured > "normally" with unique URLs and GET requests. > > For the page numbering, this seems okay: the HTML outputs all the items > across all the pages, and then the actual pagination on the page is purely > client-side javascript - so if you were to read the HTML on the page via > python or so, you would just get all the items in the table without having > to worry about pagination. > > Unfortunately, this does seem like a lot of work and I don't really have > the time to do anything, but it seemed like an interesting problem and I > was curious so I took a look. Hope it could help a bit. > > All the best, > Sanjay > > On Fri, Feb 4, 2022 at 1:03 PM Piyush Kumar <psh.kumar1...@gmail.com> > wrote: > >> Could folks here suggest how to go about this? >> >> >> https://missionantyodaya.nic.in/preloginStateInfrastructureReports2020.html >> >> When we click this link, we get data on village-level infrastructure put >> within multiple HTML tables across many pages (separated into state, dist., >> block etc.) >> >> Suppose I want to scrape data upto the village level for a particular >> state, is there any way I can get it done without too much back and forth >> over Selenium webdriver? Please note that to access village level data you >> have to go through a nested hierarchy of links (gram panchyt within block, >> which is within a district and so on). To make matters more complicated, >> the pages have also not been numbered. >> >> Can someone in the know help me figure this out? >> >> Thanks in advance >> Piyush >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to datameet+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com >> <https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com > <https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAA%2BkmBBcOiXN%3D0qkarvnEAdtBVu84%2B5zQ_NpGmXtK7U%2BB7DnsA%40mail.gmail.com.