Hi, I don't think Selenium is required - this looks like it can be done with just varying the request payload of one POST api call. POST api call to URL: https://missionantyodaya.nic.in/preloginVillageInfrastructureReports2020.html the POST request content type is application/x-www-form-urlencoded
at *state level*, request payload is like: stateCode: 27 stateName: MAHARASHTRA districtCode: districtName: blockCode: blockName: gpCode: gpName: It* district level* it becomes: stateCode: 27 stateName: MAHARASHTRA districtCode: 469 districtName: AURANGABAD blockCode: blockName: gpCode: gpName: then *block level*: stateCode: 27 stateName: MAHARASHTRA districtCode: 469 districtName: AURANGABAD blockCode: 4315 blockName: KHULTABAD gpCode: gpName: then* GP level:* stateCode: 27 stateName: MAHARASHTRA districtCode: 469 districtName: AURANGABAD blockCode: 4315 blockName: KHULTABAD gpCode: 170584 gpName: BODKHA If in python, one can use Beautifulscrape to capture the table data as well as get the (code + name) pairs for the next level. -- Cheers, Nikhil VJ https://nikhilvj.co.in On Fri, Feb 4, 2022 at 1:42 PM Sanjay Bhangar <sanjaybhan...@gmail.com> wrote: > Piyush - > > You could write a python (or your preferred language) script that just > requests the HTML, parses it, and follows the hierarchy, without using > selenium. This could be a bunch of work as the site doesn't use regular > links with GET requests, but rather when you click on a state in the table, > it uses Javascript to fill up hidden form fields with the state code, etc. > and then does a form submit, causing a POST request to be made with those > values. > > For eg. you can see the links in the table have an onClick handler like > "selectState(2,'HIMACHAL > PRADESH','preloginDistrictInfrastructureReports2020.html')" . > > Then, in the javascript, you can see the selectState function defined like > so: > > function selectState(stateCode,stateName,action){ > $("#stateCode").val(stateCode); > $("#stateName").val(stateName); > $("#reportForm").attr('action', action); > $("#reportForm").submit(); > > } > > In this JS file: > https://missionantyodaya.nic.in/resources/antyodaya/js/custom/prelogin/reports/preloginReport.js > > So this will make a POST request to > preloginDistrictInfrastructureReports2020.html > with stateCode=2, stateName=HIMACHAL PRADESH > > Similarly, there are different onCick handlers defined for selecting > districts, etc. that you can follow down to see what URLs they are calling > with what parameters. And in theory, you could write some HTML parsing code > and some regex to go through the items in each table, parse out the > parameters and URLs to call, and follow things down. > > So, in theory you could write this without mucking around with selenium, > but it also seems like a lot more work than if the site was structured > "normally" with unique URLs and GET requests. > > For the page numbering, this seems okay: the HTML outputs all the items > across all the pages, and then the actual pagination on the page is purely > client-side javascript - so if you were to read the HTML on the page via > python or so, you would just get all the items in the table without having > to worry about pagination. > > Unfortunately, this does seem like a lot of work and I don't really have > the time to do anything, but it seemed like an interesting problem and I > was curious so I took a look. Hope it could help a bit. > > All the best, > Sanjay > > On Fri, Feb 4, 2022 at 1:03 PM Piyush Kumar <psh.kumar1...@gmail.com> > wrote: > >> Could folks here suggest how to go about this? >> >> >> https://missionantyodaya.nic.in/preloginStateInfrastructureReports2020.html >> >> When we click this link, we get data on village-level infrastructure put >> within multiple HTML tables across many pages (separated into state, dist., >> block etc.) >> >> Suppose I want to scrape data upto the village level for a particular >> state, is there any way I can get it done without too much back and forth >> over Selenium webdriver? Please note that to access village level data you >> have to go through a nested hierarchy of links (gram panchyt within block, >> which is within a district and so on). To make matters more complicated, >> the pages have also not been numbered. >> >> Can someone in the know help me figure this out? >> >> Thanks in advance >> Piyush >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to datameet+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com >> <https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com > <https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuNzEB%3DUVqgG0mYVtrKjWTHeAdN6d_%3DFnz9LLCsE4QH1eA%40mail.gmail.com.