When I opened the file with a hex-editor, the problematic character turned out to be “1a”
I am attaching a sample DAT file with 3 lines (the second line is the one with the undesirable character). The furthest I could get was through readBin: > tmp<- readBin("new.dat", what = "raw", n=100000000) [1] 30 32 3a 33 35 3a 33 32 2c 20 34 34 30 33 2c 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36 2c 31 [33] 35 35 2e 39 2c 30 30 2e 37 36 2c 31 31 35 36 0d 0a 30 32 3a 33 35 3a 33 35 2c 20 34 34 33 32 2c [65] 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36 2c 31 35 35 2e 38 2c 1a 30 2e 38 31 2c 31 31 35 37 [97] 0d 0a 30 32 3a 33 35 3a 33 39 2c 20 34 34 36 37 2c 20 33 37 2e 31 31 34 2c 2d 32 30 2e 38 33 36 [129] 2c 31 35 35 2e 38 2c 30 30 2e 38 31 2c 31 31 35 38 > tmp[87] [1] 1a The idea now is as Jim suggested, replace “1a” by (for example) “20” in the raw format and write the file back with writeBin(tmp, "new2.dat") Can I use gsub? How can I perform this operation without messing around with the raw format? Thanks J On Thu, Mar 4, 2010 at 8:35 PM, jim holtman <jholt...@gmail.com> wrote: > Have you considered reading the file in a binary/raw, finding the > offending character and replacing it with a blank (or whatever and > then writing the file back out). You can then probably process it > using read.table.; > > On Thu, Mar 4, 2010 at 12:50 PM, jonas garcia > <garcia.jona...@googlemail.com> wrote: > > Thank you so much for your reply. > > > > > > > > I can identify the characters very easily in a couple of files. The > reason I > > am worried is that I have thousands of files to read in. The files were > > produced in a very old MS-DOS software that records information on > > oceanographic data and geographic position during a survey. > > > > > > > > My main goal is read all these files into R for further analysis. Most of > > the files are cleared of these EOL markers but some are not. I only > noticed > > the problem by chance when I was looking and comparing one of them. I > wonder > > if I can solve this problem using R, without having to go for text > editors > > separately. > > > > > > > > Help on this would be much appreciated. > > > > Thanks again > > > > > > > > J > > > > > > On 3/4/10, David Winsemius <dwinsem...@comcast.net> wrote: > >> > >> > >> On Mar 3, 2010, at 2:22 PM, jonas garcia wrote: > >> > >> Dear R users, > >>> > >>> I am trying to read a huge file in R. For some reason, only a part of > the > >>> file is read. When I further investigated, I found that in one of my > >>> non-numeric columns, there is one odd character responsible for this, > >>> which > >>> I reproduce bellow: > >>> In case you cannot see it, it looks like a right arrow, but it is not > the > >>> one you get from microsoft word in menu "insert symbol". > >>> > >>> I think my dat file is broken and that funny character is an EOL marker > >>> that > >>> makes R not read the rest of the file. I am sure the character is there > by > >>> chance but I fear that it might be present in some other big files I > have > >>> to > >>> work with as well. So, is there any clever way to remove this > inconvenient > >>> character in R avoiding having to edit the file in notepad and remove > it > >>> manually? > >>> > >>> Code I am using: > >>> > >>> read.csv("new3.dat", header=F) > >>> > >>> Warning message: > >>> In read.table(file = file, header = header, sep = sep, quote = quote, > : > >>> incomplete final line found by readTableHeader on 'new3.dat' > >>> > >> > >> I think you should identify the offending line by using the count.fields > >> function and fix it with an editor. > >> > >> > >> -- > >> David > >> > >>> > >>> I am working with R 2.10.1 in windows XP. > >>> > >>> Thanks in advance > >>> > >>> Jonas > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > <http://www.r-project.org/posting-guide.html> > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > >> David Winsemius, MD > >> Heritage Laboratories > >> West Hartford, CT > >> > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? >
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.