Re: [R] Reading very large text files into R

Ebert,Timothy Aaron Fri, 30 Sep 2022 14:59:13 -0700

The point was more to figure out why most lines have 15 values and some give an 
error indicating that there are 16. Are there notes, or an extra comma? Some 
weather stations fail and give interesting data at, before, or after failure. 
Are the problem lines indicating machine failure? Typically code does not 
randomly enter extra data. Most answers appear to assume that the 16th column 
has been entered at the end of the data, but no evidence indicates this is 
true. If there is an initial value at the beginning of the row, then all of the 
data for that row will be in error if the "16" value is deleted. I am just 
paranoid enough to suggest looking at one case to make sure all is as assumed.
   Another way to address the problem is to test the data. Are there 
temperatures less than -100 C or greater than 60 C? Why would one ever get such 
a thing? Machine error, or a column misaligned so that humidity values are in 
the temperature column.

Tim 

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of avi.e.gr...@gmail.com
Sent: Friday, September 30, 2022 3:16 PM
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

Tim and others,

A point to consider is that there are various algorithms in the functions used 
to read in formatted data into data.frame form and they vary. Some do a 
look-ahead of some size to determine things and if they find a column that 
LOOKS LIKE all integers for say the first thousand lines, they go and read in 
that column as integer. If the first floating point value is thousands of lines 
further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine for 
an algorithm that looks ahead and concludes there are 16 columns throughout. 
Yet a file where the first time a sixteenth entry is seen is at line/row 31,459 
may well just set the algorithm to expect exactly 15 columns and then be 
surprised as noted above.

I have stayed out of this discussion and others have supplied pretty much what 
I would have said. I also see the data as flawed and ask which rows are the 
valid ones. If a sixteenth column is allowed, it would be better if all other 
rows had an empty sixteenth column. If not allowed, none should have it.

The approach I might take, again as others have noted, is to preprocess the 
data file using some form of stream editor such as AWK that automagically reads 
in a line at a time and parses lines into a collection of tokens based on what 
separates them such as a comma. You can then either write out just the first 15 
to the output stream if your choice is to ignore a spurious sixteenth, or write 
out all sixteen for every line, with the last being some form of null most of 
the time. And, of course, to be more general, you could make two passes through 
the file with the first one determining the maximum number of entries as well 
as what the most common number of entries is, and a second pass using that info 
to normalize the file the way you want. And note some of what was mentioned 
could often be done in this preprocessing such as removing any columns you do 
not want to read into R later. Do note such filters may need to handle edge 
cases like skipping comment lines or treating the row of headers differently.

As some have shown, you can create your own filters within a language like R 
too and either read in lines and pre-process them as discussed or continue on 
to making your own data.frame and skip the read.table() type of functionality. 
For very large files, though, having multiple variations in memory at once may 
be an issue, especially if they are not removed and further processing and 
analysis continues.

Perhaps it might be sensible to contact those maintaining the data and point 
out the anomaly and ask if their files might be saved alternately in a format 
that can be used without anomalies.

Avi

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Ebert,Timothy Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe <rao...@gmail.com>; Nick Wray <nickmw...@gmail.com>
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of 
data with 16 entries?

Tim

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=5w0Yrih%2Fxf09zpgabscAzMTVzcw4nhjNKX5%2FgWEPVWk%3D&amp;reserved=0
PLEASE do read the posting guide 
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=K4ddCaLbSB5XU8TELCMhDEFsG4drevbeRp2YKPxY2ag%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Reading very large text files into R

Reply via email to