This is very interesting. Can this be used for commercial purposes? Where I can read about data policy on this?
Regards, Pradeep On Mon, Nov 14, 2016 at 9:21 AM, Nikhil VJ <nikhil...@gmail.com> wrote: > Hi friends, > > I've created some shell scripts to aggregate the data from downloaded > 7/12 records (html files) into two csv's. Sharing a github link having > the code and instructions: > https://github.com/answerquest/mahabhulekh-7-12-aggregating > > Still no luck on automated scraping from the site, but this > aggregating was the next step and has really simplified the process of > inspecting multiple records at once. > > -Nikhil > > On 10/27/16, Nikhil VJ <nikhil...@gmail.com> wrote: > > Hi Ankit, > > > > Thanks for the R lead! I checked it out.. am already doing something > > like it using some quick shell/bash commands and a python script that > > converts any html table to csv (http://stackoverflow.com/a/16697784). > > Once we have the data down in HTMLs it's fairly straightforward. This > > part come after the scraping. > > > > The data in this case is not in permanent HTMLs that we can just save > > in batch. It's being generated at server-side on Mahabhulekh server > > depending on form inputs in an authenticated user session and then > > being rendered as html at one constant URL. So what I'm looking for is > > something that would simulate / automate (with due time intervals > > between each call of course, we must not overload the server) the > > calls to the mahabhulekh server, and capture the output it is > > returning. > > > > So far I'm not able to programmatically capture the HTML coming in the > > popup window it is generating. The POST request returns a generic null > > response or the site's main webpage in all the wget and curl commands > > I've tried. Folks who have done some scraping earlier might be able to > > help. > > > > Another track worth exploring might be iMacros or other ways to > > automate browser sessions. Foiks working in testing departments of > > ticketing / booking sites etc might know and could help, so please > > share this with your friends working in such projects! > > > > I've read at some places R can be used to simulate this.. so yes it'll > > be worth to keep exploring but I know shell scripting more so hoping > > something comes there. > > > > -- > > -- > > Cheers, > > Nikhil > > +91-966-583-1250 > > Pune, India > > Self-designed learner at Swaraj University > > <http://www.swarajuniversity.org> > > Blog <http://nikhilsheth.blogspot.in> | Contribute > > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > > > > > > > > On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote: > >> Though I am not very well conversant with Data Sciences and web > scraping, > >> we had a recent DataKind meetup > >> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in > Bangalore, > >> where Bargava talked about using R's rvest library > >> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. > We > >> were able to do some basic scraping on goodreads with this. See if this > >> fits your needs. > >> > >> Thanks, > >> Ankit > >> > >> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com> > wrote: > >> > >>> Hi, > >>> > >>> I'm looking at Maharashtra's land records portal : > >>> https://mahabhulekh.maharashtra.gov.in > >>> > >>> .. and wondering if it's possible to scrape data from here? > >>> > >>> Will share a workflow: > >>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव > >>> select शोध : सर्वे नंबर / गट नंबर (first option) > >>> type 1 in the text box and press the "शोधा" button > >>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc. > >>> > >>> On selecting any and clicking "७/१२ पहा", > >>> a new window/tab opens up (you have to enable popups), having static > >>> HTML content (some tables). I need to capture this content. > >>> > >>> The URL is always the same: > >>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx > >>> ..but the content changes depending on the options chosen. > >>> > >>> On using the browser's "Inspect Element"> Network and clicking the > >>> final button, there is a request to this URL: > >>> > >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712 > >>> > >>> and the request Params / Payload is like: > >>> > >>> {'sno':'1','vid':'273200030398260000','dn':'रत् > >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > >>> did':'32','tid':'3'} > >>> > >>> when you change the survey/gat number to 1/10, the params change like > >>> so: > >>> {'sno':'1#10','vid':'273200030398260000','dn':'रत् > >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > >>> did':'32','tid':'3'} > >>> > >>> for 1/1अ: > >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् > >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > >>> did':'32','tid':'3'} > >>> > >>> I tried some wget and curl commands but no luck so far. Do let me know > >>> if you can make some headway. > >>> > >>> Also, it would be great to also learn how to extract on the list of > >>> districts, talukas (subdistricts) in each district, and villages in > >>> each taluka. > >>> > >>> dumping other info at bottom if it helps. > >>> > >>> Why do this: > >>> At present it's just an exploration following on from our work on > >>> village shapefiles. > >>> The district > taluka > village mapping data from official Land > >>> Records data could serve as a good source for triangulation. > >>> Then, while I don't see myself going deeper into this right now, I am > >>> aware that land records / ownership has major corruption, > >>> entanglements and other issues precisely because of the lack of > >>> transparency. The mahabhulekh website itself is a significant step > >>> forward in making this sector a little more transparent, and more push > >>> in this direction would probably do more good IMHO. At some point > >>> GIS/lat-long info might come in, and it would be good to bring the > >>> data to a level that is ready for it. > >>> > >>> > >>> Data dump: > >>> When we press the button to fetch the 7/12 (saatbarah) record, the > >>> console records a POST with these parameters: > >>> > >>> Copy as cURL: > >>> curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712' > >>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0 > >>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H > >>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language: > >>> en-US,en;q=0.5' --compressed -H 'Content-Type: > >>> application/json;charset=utf-8' -H 'Referer: > >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H > >>> 'Content-Length: 170' -H 'Cookie: > >>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection: > >>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > >>> > >>> Copy POST data: > >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् > >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > >>> did':'32','tid':'3'} > >>> > >>> request headers: > >>> POST /Konkan/Home.aspx/call712 HTTP/1.1 > >>> Host: mahabhulekh.maharashtra.gov.in > >>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0) > >>> Gecko/20100101 Firefox/42.0 > >>> Accept: application/json, text/plain, */* > >>> Accept-Language: en-US,en;q=0.5 > >>> Accept-Encoding: gzip, deflate > >>> Content-Type: application/json;charset=utf-8 > >>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx > >>> Content-Length: 170 > >>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc > >>> Connection: keep-alive > >>> Pragma: no-cache > >>> Cache-Control: no-cache > >>> > >>> response headers: > >>> HTTP/1.1 200 OK > >>> Cache-Control: private, max-age=0 > >>> Content-Type: application/json; charset=utf-8 > >>> Server: Microsoft-IIS/8.0 > >>> X-Powered-By: ASP.NET > >>> Date: Mon, 24 Oct 2016 15:31:40 GMT > >>> Content-Length: 10 > >>> > >>> Copy Response: > >>> {"d":null} > >>> > >>> > >>> -- > >>> -- > >>> Cheers, > >>> Nikhil > >>> +91-966-583-1250 > >>> Pune, India > >>> Self-designed learner at Swaraj University < > http://www.swarajuniversity. > >>> org> > >>> Blog <http://nikhilsheth.blogspot.in> | Contribute > >>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > >>> > >>> -- > >>> Datameet is a community of Data Science enthusiasts in India. Know more > >>> about us by visiting http://datameet.org > >>> --- > >>> You received this message because you are subscribed to the Google > >>> Groups > >>> "datameet" group. > >>> To unsubscribe from this group and stop receiving emails from it, send > >>> an > >>> email to datameet+unsubscr...@googlegroups.com. > >>> For more options, visit https://groups.google.com/d/optout. > >>> > >> > >> -- > >> Datameet is a community of Data Science enthusiasts in India. Know more > >> about us by visiting http://datameet.org > >> --- > >> You received this message because you are subscribed to the Google > Groups > >> "datameet" group. > >> To unsubscribe from this group and stop receiving emails from it, send > an > >> email to datameet+unsubscr...@googlegroups.com. > >> For more options, visit https://groups.google.com/d/optout. > >> > > > > > > -- > > -- > > Cheers, > > Nikhil > > +91-966-583-1250 > > Pune, India > > Self-designed learner at Swaraj University > > <http://www.swarajuniversity.org> > > Blog <http://nikhilsheth.blogspot.in> | Contribute > > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > > > > > -- > -- > Cheers, > Nikhil > +91-966-583-1250 > Pune, India > Self-designed learner at Swaraj University <http://www.swarajuniversity.o > rg> > Blog <http://nikhilsheth.blogspot.in> | Contribute > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.