Hello,

First of all, I would like to congratulate team behind data.gov.in for the
effort to create this website. I hope, this will create a bloom of data
driven research in governance. It will additionally serve well as a
starting point for Computer Science or Statistics Graduate for their
projects.

Within this framework of dissemination of certainly valuable data, I would
like to provide my inputs after walking through the website and attempting
to process some data-sets. In this context, I would like to also add that
my point of you is entirely from engineers and software integrators
perspective.

1) Clear explanation of fields (Issue of lexicon)
In several files, the data contains abbreviations which have not been fully
explained. MGNREGA is quite popular scheme, but it stands as an exception.
It is easier to compress terminology when individuals you are working with
on day to day, even essential. But an outsider who plans to use this data
will need to visit concerned department or research through Internet to get
to understand what the data means. She might as well go and collect data
from concerned department in the first place.

2) Question of relations
Several data sets are related to each other. To illustrate "Summary Of
Railway Statistics From 2002-03 To 2010-11" is kind of aggregate  parent of
"Number Of Persons Killed And Injured In Railway Related Accidents From
2002-03 To 2010-11". But this relation has to be figured out by consumer of
the website herself. The sites does not help establishing that relation by
default. You could definitely filter data-sets by ministry but that
necessarily may not be related data.

3) Question of data dimensions
Excel is great data tool. In fact, I would go further and say, for several
people it may be their first introduction to programming. Sadly Excel does
not help think through data dimensions. That is something user has to do
herself. To illustrate and example (fictitious)

                                           Companies registered
Delhi, 2011                             40
Arunachal Pradesh, 2011         10
Arunachal Pradesh, 2012         20

It is clear that 3rd dimension which is year has been compressed. This
requires extra effort from data consumer this clean data. It would be
helpful if the team does some preliminary checks on that data for these
logical follies

4) Non availability of data
Not appllicable is different from non-available and is different from zero
which is additionally different from empty. In several cases the data
points have been marked as NA. What does NA mean in this context? We could
assume several things namely:
      - Data is not available
      - Data is not applicable
      - Data is zero
      - It is empty data
These are different from each other in sometimes subtle and sometimes not
so subtle ways. I think data should have clear labelling of these four types

5) Certificate issue (important)
There are several file format options on data-set. Except for Excel no
other format is usable. I would go further and say that people in
data.gov.in team have not tested other formats at all. The certificate for
https is not valid. NIC root certificate is not recognized by any browser.
However hard anyone tells me, I will not install a certificate because
people at NIC are lazy enough to not get their certificate included in all
the browsers. In several cases, it is not even possible to install a
certificate. Like in case, if someone uses a tablet  to visit the site.
Additionally browsers are not the only http clients. I use R and I could
import data into R directly if valid certificate exists. To show an example
of how my R session went with data.gov.in data


R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i686-pc-linux-gnu (32-bit)
> require("RCurl")
Loading required package: RCurl
Loading required package: bitops
> read.table(textConnection(getURL("
https://datacms.nic.in/datatool/?url=http://www.data.gov.in//sites/default/files/DETAILS_OF_GROSS_TRAFFIC_EARNINGS_1.XLS&format=jsonp
")))
Error in function (type, msg, asError = TRUE)  :
  SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify
failed
>


R along with SAS and SPSS will be the primary tools of processing data I
assume. I think comprehensive testing should be done that the availability
of data via these tools is flawless


I am sure team at data.gov.in will address at-least some the issues
immediately. I wish best for their endeavour.

-- 
Supreet Sethi
Ph IN: +919811143517
Ph Skype: d_j_i_n_n
Profile: http://www.google.com/profiles/supreet.sethi
Twt: http://twitter.com/djinn
_______________________________________________
Ilugd mailing list
Ilugd@lists.linux-delhi.org
http://frodo.hserus.net/mailman/listinfo/ilugd

Reply via email to