Brian,

Thanks. My response to David follows. I should add that this problem has
never occurred previously as far as I know (I have now checked the
previous report I was sent):

Hello David,

Thanks for your e-mail. The data was a report derived from a statewide
database, saved in EXCEL format, so the usual issue of the vagaries of
human data entry variation wasn't the issue as the data was an automated
report, which is run every three months. I would not have even noticed
this problem if I hadn't been double checking the numbers of people by
district. Visual inspection didn't reveal this problem - no white space
was obvious and the spelling was identical. Tabulation via R wouldn't have
detected this - I was obtaining the EXCEL totals via filter which I then
compared with R output. I'm hoping I can skip this step, in future, with
Jim's suggestion.

regards

Bob






> On Fri, 14 Jan 2011, David Scott wrote:
>
>> As a further note, this is a reminder that whenever you get data via
>> a spreadsheet the first thing to do is examine it and clean up any
>> problems. A basic requirement is to tabulate any categorical
>> variable. Spreadsheets allow any sort of data to be entered, with no
>> controls. My experience is that those who enter data into
>> spreadsheets enter all sorts of variations of what a human would
>> wish to treat as the same ("Open", "Open ", "open", etc.), even when
>> told not to.
>
> Another common problem is that they enter characters such as
> non-breaking space or zero-width characters: we added support for
> known encodings of NBSP to strip.white about five years ago.
>
>>
>> David Scott
>>
>> On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>>> try strip.white=TRUE to strip out white space
>>>
>>> Sent from my iPad
>>>
>>> On Jan 13, 2011, at 21:44, bgr...@dyson.brisnet.org.au wrote:
>>>
>>>>
>>>> I have a frustrating issue which I am hoping someone may have a
>>>> suggestion
>>>> about.
>>>>
>>>> I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
>>>> as a
>>>> csv file.
>>>>
>>>> The initial code I ran follows.
>>>>
>>>> dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>>> dec.open<- subset (dec, Status == "Open")
>>>> table(dec.open$AMHS)
>>>>
>>>> I was checking the output and noticed a difference between my manual
>>>> count
>>>> and R output. Two subject's rows were not being detected by the subset
>>>> command:
>>>>
>>>> For the AMHS where there was a discrepancy I then ran:
>>>> wm<- subset (dec, AMHS == "WM")
>>>>
>>>> The problem appears to be that there is a space before the 'Open"
>>>> value
>>>> for two indivduals, as per the example below.
>>>>
>>>> 10/02/2010  Open
>>>> 22/08/2007   Open
>>>>
>>>> Checking in EXCEL there does not appear to be a space and the format
>>>> is
>>>> the same (e.g 'general').  I resolved the problem by copying over the
>>>> values for the two individuals where I identified  a problem.
>>>>
>>>> Given this problem was not detected by visual scanning I would
>>>> appreciate
>>>> advice on how this problem can be detected in future without my having
>>>> to
>>>> manually check raw data against R output.
>>>>
>>>> Any assistance is appreciated,
>>>>
>>>> Bob
>>>>
>>>> ______________________________________________
>>>> R-help@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>> _________________________________________________________________
>> David Scott  Department of Statistics
>>              The University of Auckland, PB 92019
>>              Auckland 1142,    NEW ZEALAND
>> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>> Email:       d.sc...@auckland.ac.nz,  Fax: +64 9 373 7018
>>
>> Director of Consulting, Department of Statistics
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Brian D. Ripley,                  rip...@stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to