Piyush - You could write a python (or your preferred language) script that just requests the HTML, parses it, and follows the hierarchy, without using selenium. This could be a bunch of work as the site doesn't use regular links with GET requests, but rather when you click on a state in the table, it uses Javascript to fill up hidden form fields with the state code, etc. and then does a form submit, causing a POST request to be made with those values.
For eg. you can see the links in the table have an onClick handler like "selectState(2,'HIMACHAL PRADESH','preloginDistrictInfrastructureReports2020.html')" . Then, in the javascript, you can see the selectState function defined like so: function selectState(stateCode,stateName,action){ $("#stateCode").val(stateCode); $("#stateName").val(stateName); $("#reportForm").attr('action', action); $("#reportForm").submit(); } In this JS file: https://missionantyodaya.nic.in/resources/antyodaya/js/custom/prelogin/reports/preloginReport.js So this will make a POST request to preloginDistrictInfrastructureReports2020.html with stateCode=2, stateName=HIMACHAL PRADESH Similarly, there are different onCick handlers defined for selecting districts, etc. that you can follow down to see what URLs they are calling with what parameters. And in theory, you could write some HTML parsing code and some regex to go through the items in each table, parse out the parameters and URLs to call, and follow things down. So, in theory you could write this without mucking around with selenium, but it also seems like a lot more work than if the site was structured "normally" with unique URLs and GET requests. For the page numbering, this seems okay: the HTML outputs all the items across all the pages, and then the actual pagination on the page is purely client-side javascript - so if you were to read the HTML on the page via python or so, you would just get all the items in the table without having to worry about pagination. Unfortunately, this does seem like a lot of work and I don't really have the time to do anything, but it seemed like an interesting problem and I was curious so I took a look. Hope it could help a bit. All the best, Sanjay On Fri, Feb 4, 2022 at 1:03 PM Piyush Kumar <psh.kumar1...@gmail.com> wrote: > Could folks here suggest how to go about this? > > https://missionantyodaya.nic.in/preloginStateInfrastructureReports2020.html > > When we click this link, we get data on village-level infrastructure put > within multiple HTML tables across many pages (separated into state, dist., > block etc.) > > Suppose I want to scrape data upto the village level for a particular > state, is there any way I can get it done without too much back and forth > over Selenium webdriver? Please note that to access village level data you > have to go through a nested hierarchy of links (gram panchyt within block, > which is within a district and so on). To make matters more complicated, > the pages have also not been numbered. > > Can someone in the know help me figure this out? > > Thanks in advance > Piyush > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com > <https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com.