Piyush -

You could write a python (or your preferred language) script that just
requests the HTML, parses it, and follows the hierarchy, without using
selenium. This could be a bunch of work as the site doesn't use regular
links with GET requests, but rather when you click on a state in the table,
it uses Javascript to fill up hidden form fields with the state code, etc.
and then does a form submit, causing a POST request to be made with those
values.

For eg. you can see the links in the table have an onClick handler
like "selectState(2,'HIMACHAL
PRADESH','preloginDistrictInfrastructureReports2020.html')" .

Then, in the javascript, you can see the selectState function defined like
so:

function selectState(stateCode,stateName,action){       
        $("#stateCode").val(stateCode); 
        $("#stateName").val(stateName); 
        $("#reportForm").attr('action', action);
        $("#reportForm").submit();

}

In this JS file:
https://missionantyodaya.nic.in/resources/antyodaya/js/custom/prelogin/reports/preloginReport.js

So this will make a POST request to
preloginDistrictInfrastructureReports2020.html
with stateCode=2, stateName=HIMACHAL PRADESH

Similarly, there are different onCick handlers defined for selecting
districts, etc. that you can follow down to see what URLs they are calling
with what parameters. And in theory, you could write some HTML parsing code
and some regex to go through the items in each table, parse out the
parameters and URLs to call, and follow things down.

So, in theory you could write this without mucking around with selenium,
but it also seems like a lot more work than if the site was structured
"normally" with unique URLs and GET requests.

For the page numbering, this seems okay: the HTML outputs all the items
across all the pages, and then the actual pagination on the page is purely
client-side javascript - so if you were to read the HTML on the page via
python or so, you would just get all the items in the table without having
to worry about pagination.

Unfortunately, this does seem like a lot of work and I don't really have
the time to do anything, but it seemed like an interesting problem and I
was curious so I took a look. Hope it could help a bit.

All the best,
Sanjay

On Fri, Feb 4, 2022 at 1:03 PM Piyush Kumar <psh.kumar1...@gmail.com> wrote:

> Could folks here suggest how to go about this?
>
> https://missionantyodaya.nic.in/preloginStateInfrastructureReports2020.html
>
> When we click this link, we get data on village-level infrastructure put
> within multiple HTML tables across many pages (separated into state, dist.,
> block etc.)
>
> Suppose I want to scrape data upto the village level for a particular
> state, is there any way I can get it done without too much back and forth
> over Selenium webdriver? Please note that to access village level data you
> have to go through a nested hierarchy of links (gram panchyt within block,
> which is within a district and so on). To make matters more complicated,
> the pages have also not been numbered.
>
> Can someone in the know help me figure this out?
>
> Thanks in advance
> Piyush
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com
> <https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/CAG3W7ZE475WmeyR6Y9uXhKNh%3DLL7%3DhCwgeCjZ_fciEdWcfR_pA%40mail.gmail.com.

Reply via email to