On Mar 4, 2010, at 9:47 PM, jonas garcia wrote:

When I opened the file with a hex-editor, the problematic character turned out to be “1a” I am attaching a sample DAT file with 3 lines (the second line is the one with the undesirable character).

The furthest I could get was through readBin:

> tmp<- readBin("new.dat", what = "raw", n=100000000)
[1] 30 32 3a 33 35 3a 33 32 2c 20 34 34 30 33 2c 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36 2c 31 [33] 35 35 2e 39 2c 30 30 2e 37 36 2c 31 31 35 36 0d 0a 30 32 3a 33 35 3a 33 35 2c 20 34 34 33 32 2c [65] 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36 2c 31 35 35 2e 38 2c 1a 30 2e 38 31 2c 31 31 35 37 [97] 0d 0a 30 32 3a 33 35 3a 33 39 2c 20 34 34 36 37 2c 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36
[129] 2c 31 35 35 2e 38 2c 30 30 2e 38 31 2c 31 31 35 38


> tmp[87]
[1] 1a

I got a different "interpretation" of that character when I let R look at it. And I cannot figure out why \032 should be causing problems??? :

> tmporg <- readLines(con="/Users/davidwinsemius/Library/Mail Downloads/new.dat")
Warning message:
In readLines(con = "/Users/davidwinsemius/Library/Mail Downloads/ new.dat") : incomplete final line found on '/Users/davidwinsemius/Library/Mail Downloads/new.dat'

> tmporg
[1] "02:35:32, 4403, 37.114,-20.836,155.9,00.76,1156"
[2] "02:35:35, 4432, 37.114,-20.836,155.8,\0320.81,1157"
[3] "02:35:39, 4467, 37.114,-20.836,155.8,00.81,1158"
> gsub("\\\032", ' ', tmporg)
[1] "02:35:32, 4403, 37.114,-20.836,155.9,00.76,1156" "02:35:35, 4432, 37.114,-20.836,155.8, 0.81,1157"
[3] "02:35:39, 4467, 37.114,-20.836,155.8,00.81,1158"


> read.table(textConnection(gsub("\\\032", ' ', tmporg) ) ,sep=",")
        V1   V2     V3      V4    V5   V6   V7
1 02:35:32 4403 37.114 -20.836 155.9 0.76 1156
2 02:35:35 4432 37.114 -20.836 155.8 0.81 1157
3 02:35:39 4467 37.114 -20.836 155.8 0.81 1158

Looks like gsub might work well .... as long as you can get agreement on what the character really is.

> sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines stats graphics grDevices utils datasets methods base

other attached packages:
[1] Design_2.3-0    Hmisc_3.7-0     survival_2.35-7

loaded via a namespace (and not attached):
[1] cluster_1.12.1  grid_2.10.1     lattice_0.17-26 tools_2.10.1

--
David.

The idea now is as Jim suggested, replace “1a” by (for example) “20” in the raw format and write the file back with
writeBin(tmp, "new2.dat")

Can I use gsub? How can I perform this operation without messing around with the raw format?

Thanks
J




On Thu, Mar 4, 2010 at 8:35 PM, jim holtman <jholt...@gmail.com> wrote:
Have you considered reading the file in a binary/raw, finding the
offending character and replacing it with a blank (or whatever and
then writing the file back out).  You can then probably process it
using read.table.;

On Thu, Mar 4, 2010 at 12:50 PM, jonas garcia
<garcia.jona...@googlemail.com> wrote:
> Thank you so much for your reply.
>
>
>
> I can identify the characters very easily in a couple of files. The reason I > am worried is that I have thousands of files to read in. The files were
> produced in a very old MS-DOS software that records information on
> oceanographic data and geographic position during a survey.
>
>
>
> My main goal is read all these files into R for further analysis. Most of > the files are cleared of these EOL markers but some are not. I only noticed > the problem by chance when I was looking and comparing one of them. I wonder > if I can solve this problem using R, without having to go for text editors
> separately.
>
>
>
> Help on this would be much appreciated.
>
> Thanks again
>
>
>
> J
>
>
> On 3/4/10, David Winsemius <dwinsem...@comcast.net> wrote:
>>
>>
>> On Mar 3, 2010, at 2:22 PM, jonas garcia wrote:
>>
>> Dear R users,
>>>
>>> I am trying to read a huge file in R. For some reason, only a part of the >>> file is read. When I further investigated, I found that in one of my >>> non-numeric columns, there is one odd character responsible for this,
>>> which
>>> I reproduce bellow:
>>> In case you cannot see it, it looks like a right arrow, but it is not the
>>> one you get from microsoft word in menu "insert symbol".
>>>
>>> I think my dat file is broken and that funny character is an EOL marker
>>> that
>>> makes R not read the rest of the file. I am sure the character is there by >>> chance but I fear that it might be present in some other big files I have
>>> to
>>> work with as well. So, is there any clever way to remove this inconvenient >>> character in R avoiding having to edit the file in notepad and remove it
>>> manually?
>>>
>>> Code I am using:
>>>
>>> read.csv("new3.dat", header=F)
>>>
>>> Warning message:
>>> In read.table(file = file, header = header, sep = sep, quote = quote, :
>>>  incomplete final line found by readTableHeader on 'new3.dat'
>>>
>>
>> I think you should identify the offending line by using the count.fields
>> function and fix it with an editor.
>>
>>
>> --
>> David
>>
>>>
>>> I am working with R 2.10.1 in windows XP.
>>>
>>> Thanks in advance
>>>
>>> Jonas
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html >
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

<new.dat>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to