Re: [R] Reading very large text files into R

2022-09-30 Thread Ebert,Timothy Aaron
Truth

Tim

-Original Message-
From: R-help  On Behalf Of Avi Gross
Sent: Friday, September 30, 2022 7:01 PM
Cc: R help Mailing list 
Subject: Re: [R] Reading very large text files into R

[External Email]

Those are valid reasons as examining data and cleaning or fixing it is a major 
thing to do before making an analysis or plots. Indeed, an extra column caused 
by something in an earlier column mat have messed up all columns to the right.

My point was about replicating a problem like this may require many more lines 
from the file.

On Fri, Sep 30, 2022, 5:58 PM Ebert,Timothy Aaron  wrote:

> The point was more to figure out why most lines have 15 values and 
> some give an error indicating that there are 16. Are there notes, or 
> an extra comma? Some weather stations fail and give interesting data 
> at, before, or after failure. Are the problem lines indicating machine 
> failure? Typically code does not randomly enter extra data. Most 
> answers appear to assume that the 16th column has been entered at the 
> end of the data, but no evidence indicates this is true. If there is 
> an initial value at the beginning of the row, then all of the data for that 
> row will be in error if the "16"
> value is deleted. I am just paranoid enough to suggest looking at one 
> case to make sure all is as assumed.
>Another way to address the problem is to test the data. Are there 
> temperatures less than -100 C or greater than 60 C? Why would one ever 
> get such a thing? Machine error, or a column misaligned so that 
> humidity values are in the temperature column.
>
> Tim
>
> -Original Message-
> From: R-help  On Behalf Of 
> avi.e.gr...@gmail.com
> Sent: Friday, September 30, 2022 3:16 PM
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> [External Email]
>
> Tim and others,
>
> A point to consider is that there are various algorithms in the 
> functions used to read in formatted data into data.frame form and they 
> vary. Some do a look-ahead of some size to determine things and if 
> they find a column that LOOKS LIKE all integers for say the first 
> thousand lines, they go and read in that column as integer. If the 
> first floating point value is thousands of lines further along, things may go 
> wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work 
> fine for an algorithm that looks ahead and concludes there are 16 
> columns throughout. Yet a file where the first time a sixteenth entry 
> is seen is at line/row 31,459 may well just set the algorithm to 
> expect exactly 15 columns and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty 
> much what I would have said. I also see the data as flawed and ask 
> which rows are the valid ones. If a sixteenth column is allowed, it 
> would be better if all other rows had an empty sixteenth column. If 
> not allowed, none should have it.
>
> The approach I might take, again as others have noted, is to 
> preprocess the data file using some form of stream editor such as AWK 
> that automagically reads in a line at a time and parses lines into a 
> collection of tokens based on what separates them such as a comma. You 
> can then either write out just the first 15 to the output stream if 
> your choice is to ignore a spurious sixteenth, or write out all 
> sixteen for every line, with the last being some form of null most of 
> the time. And, of course, to be more general, you could make two 
> passes through the file with the first one determining the maximum 
> number of entries as well as what the most common number of entries 
> is, and a second pass using that info to normalize the file the way 
> you want. And note some of what was mentioned could often be done in 
> this preprocessing such as removing any columns you do not want to 
> read into R later. Do note such filters may need to handle edge cases like 
> skipping comment lines or treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language 
> like R too and either read in lines and pre-process them as discussed 
> or continue on to making your own data.frame and skip the read.table() 
> type of functionality. For very large files, though, having multiple 
> variations in memory at once may be an issue, especially if they are 
> not removed and further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and 
> point out the anomaly and ask if their files might be saved 
> alternately in a format that can be used without anomalies.
>
> Avi
>
> -Original Message-
> From: R-help  On Behalf Of Ebert,Tim

Re: [R] Reading very large text files into R

2022-09-30 Thread Avi Gross
Those are valid reasons as examining data and cleaning or fixing it is a
major thing to do before making an analysis or plots. Indeed, an extra
column caused by something in an earlier column mat have messed up all
columns to the right.

My point was about replicating a problem like this may require many more
lines from the file.

On Fri, Sep 30, 2022, 5:58 PM Ebert,Timothy Aaron  wrote:

> The point was more to figure out why most lines have 15 values and some
> give an error indicating that there are 16. Are there notes, or an extra
> comma? Some weather stations fail and give interesting data at, before, or
> after failure. Are the problem lines indicating machine failure? Typically
> code does not randomly enter extra data. Most answers appear to assume that
> the 16th column has been entered at the end of the data, but no evidence
> indicates this is true. If there is an initial value at the beginning of
> the row, then all of the data for that row will be in error if the "16"
> value is deleted. I am just paranoid enough to suggest looking at one case
> to make sure all is as assumed.
>Another way to address the problem is to test the data. Are there
> temperatures less than -100 C or greater than 60 C? Why would one ever get
> such a thing? Machine error, or a column misaligned so that humidity values
> are in the temperature column.
>
> Tim
>
> -Original Message-
> From: R-help  On Behalf Of
> avi.e.gr...@gmail.com
> Sent: Friday, September 30, 2022 3:16 PM
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> [External Email]
>
> Tim and others,
>
> A point to consider is that there are various algorithms in the functions
> used to read in formatted data into data.frame form and they vary. Some do
> a look-ahead of some size to determine things and if they find a column
> that LOOKS LIKE all integers for say the first thousand lines, they go and
> read in that column as integer. If the first floating point value is
> thousands of lines further along, things may go wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work fine
> for an algorithm that looks ahead and concludes there are 16 columns
> throughout. Yet a file where the first time a sixteenth entry is seen is at
> line/row 31,459 may well just set the algorithm to expect exactly 15
> columns and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty much
> what I would have said. I also see the data as flawed and ask which rows
> are the valid ones. If a sixteenth column is allowed, it would be better if
> all other rows had an empty sixteenth column. If not allowed, none should
> have it.
>
> The approach I might take, again as others have noted, is to preprocess
> the data file using some form of stream editor such as AWK that
> automagically reads in a line at a time and parses lines into a collection
> of tokens based on what separates them such as a comma. You can then either
> write out just the first 15 to the output stream if your choice is to
> ignore a spurious sixteenth, or write out all sixteen for every line, with
> the last being some form of null most of the time. And, of course, to be
> more general, you could make two passes through the file with the first one
> determining the maximum number of entries as well as what the most common
> number of entries is, and a second pass using that info to normalize the
> file the way you want. And note some of what was mentioned could often be
> done in this preprocessing such as removing any columns you do not want to
> read into R later. Do note such filters may need to handle edge cases like
> skipping comment lines or treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language like
> R too and either read in lines and pre-process them as discussed or
> continue on to making your own data.frame and skip the read.table() type of
> functionality. For very large files, though, having multiple variations in
> memory at once may be an issue, especially if they are not removed and
> further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point out the anomaly and ask if their files might be saved alternately in
> a format that can be used without anomalies.
>
> Avi
>
> -Original Message-
> From: R-help  On Behalf Of Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe ; Nick Wray 
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
>Can you post one line of data with 15 entries followed by the next l

Re: [R] Reading very large text files into R

2022-09-30 Thread Ebert,Timothy Aaron
The point was more to figure out why most lines have 15 values and some give an 
error indicating that there are 16. Are there notes, or an extra comma? Some 
weather stations fail and give interesting data at, before, or after failure. 
Are the problem lines indicating machine failure? Typically code does not 
randomly enter extra data. Most answers appear to assume that the 16th column 
has been entered at the end of the data, but no evidence indicates this is 
true. If there is an initial value at the beginning of the row, then all of the 
data for that row will be in error if the "16" value is deleted. I am just 
paranoid enough to suggest looking at one case to make sure all is as assumed.
   Another way to address the problem is to test the data. Are there 
temperatures less than -100 C or greater than 60 C? Why would one ever get such 
a thing? Machine error, or a column misaligned so that humidity values are in 
the temperature column. 

Tim 

-Original Message-
From: R-help  On Behalf Of avi.e.gr...@gmail.com
Sent: Friday, September 30, 2022 3:16 PM
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

Tim and others,

A point to consider is that there are various algorithms in the functions used 
to read in formatted data into data.frame form and they vary. Some do a 
look-ahead of some size to determine things and if they find a column that 
LOOKS LIKE all integers for say the first thousand lines, they go and read in 
that column as integer. If the first floating point value is thousands of lines 
further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine for 
an algorithm that looks ahead and concludes there are 16 columns throughout. 
Yet a file where the first time a sixteenth entry is seen is at line/row 31,459 
may well just set the algorithm to expect exactly 15 columns and then be 
surprised as noted above.

I have stayed out of this discussion and others have supplied pretty much what 
I would have said. I also see the data as flawed and ask which rows are the 
valid ones. If a sixteenth column is allowed, it would be better if all other 
rows had an empty sixteenth column. If not allowed, none should have it.

The approach I might take, again as others have noted, is to preprocess the 
data file using some form of stream editor such as AWK that automagically reads 
in a line at a time and parses lines into a collection of tokens based on what 
separates them such as a comma. You can then either write out just the first 15 
to the output stream if your choice is to ignore a spurious sixteenth, or write 
out all sixteen for every line, with the last being some form of null most of 
the time. And, of course, to be more general, you could make two passes through 
the file with the first one determining the maximum number of entries as well 
as what the most common number of entries is, and a second pass using that info 
to normalize the file the way you want. And note some of what was mentioned 
could often be done in this preprocessing such as removing any columns you do 
not want to read into R later. Do note such filters may need to handle edge 
cases like skipping comment lines or treating the row of headers differently.

As some have shown, you can create your own filters within a language like R 
too and either read in lines and pre-process them as discussed or continue on 
to making your own data.frame and skip the read.table() type of functionality. 
For very large files, though, having multiple variations in memory at once may 
be an issue, especially if they are not removed and further processing and 
analysis continues.

Perhaps it might be sensible to contact those maintaining the data and point 
out the anomaly and ask if their files might be saved alternately in a format 
that can be used without anomalies.

Avi

-Original Message-
From: R-help  On Behalf Of Ebert,Timothy Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe ; Nick Wray 
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of 
data with 16 entries?

Tim

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-helpdata=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308daa3184c98%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001622016765705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=5w0Yrih%2Fxf09zpgabscAzMTVzcw4nhjNKX5%2FgWEPVWk%3Dreserved=0
PLEASE do read the posting guide 
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.htmldata=05%7C01%7Ctebert%40ufl.edu%7C3d75da30d3744c13847308

Re: [R] Reading very large text files into R

2022-09-30 Thread Nick Wray
Hello Thanks again for all the suggestions.  The irony is that for the
datasets I'm using the fill=T as suggested by Ivan in the first instance I
think works fine.  They're not particularly sophisticated datasets and
although I don't know what the extra Bs (of which the first one  as Avi
says does occur quite late on) actually mean I don't really need to know -
all I need is the date/time/station id/rainfall accumulation and that's
obvious once I've read the dataset in.  It has been interesting seeing the
takes of people who have a far deeper and wider understanding of R than I
do however and an education in itself... Nick

On Fri, 30 Sept 2022 at 20:16,  wrote:

> Tim and others,
>
> A point to consider is that there are various algorithms in the functions
> used to read in formatted data into data.frame form and they vary. Some do
> a
> look-ahead of some size to determine things and if they find a column that
> LOOKS LIKE all integers for say the first thousand lines, they go and read
> in that column as integer. If the first floating point value is thousands
> of
> lines further along, things may go wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work fine
> for an algorithm that looks ahead and concludes there are 16 columns
> throughout. Yet a file where the first time a sixteenth entry is seen is at
> line/row 31,459 may well just set the algorithm to expect exactly 15
> columns
> and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty much
> what I would have said. I also see the data as flawed and ask which rows
> are
> the valid ones. If a sixteenth column is allowed, it would be better if all
> other rows had an empty sixteenth column. If not allowed, none should have
> it.
>
> The approach I might take, again as others have noted, is to preprocess the
> data file using some form of stream editor such as AWK that automagically
> reads in a line at a time and parses lines into a collection of tokens
> based
> on what separates them such as a comma. You can then either write out just
> the first 15 to the output stream if your choice is to ignore a spurious
> sixteenth, or write out all sixteen for every line, with the last being
> some
> form of null most of the time. And, of course, to be more general, you
> could
> make two passes through the file with the first one determining the maximum
> number of entries as well as what the most common number of entries is, and
> a second pass using that info to normalize the file the way you want. And
> note some of what was mentioned could often be done in this preprocessing
> such as removing any columns you do not want to read into R later. Do note
> such filters may need to handle edge cases like skipping comment lines or
> treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language like
> R
> too and either read in lines and pre-process them as discussed or continue
> on to making your own data.frame and skip the read.table() type of
> functionality. For very large files, though, having multiple variations in
> memory at once may be an issue, especially if they are not removed and
> further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point
> out the anomaly and ask if their files might be saved alternately in a
> format that can be used without anomalies.
>
> Avi
>
> -Original Message-----
> From: R-help  On Behalf Of Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe ; Nick Wray 
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
>Can you post one line of data with 15 entries followed by the next line
> of data with 16 entries?
>
> Tim
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-30 Thread avi.e.gross
Tim and others,

A point to consider is that there are various algorithms in the functions
used to read in formatted data into data.frame form and they vary. Some do a
look-ahead of some size to determine things and if they find a column that
LOOKS LIKE all integers for say the first thousand lines, they go and read
in that column as integer. If the first floating point value is thousands of
lines further along, things may go wrong.

So asking for line/row 16 to have an extra 16th entry/column may work fine
for an algorithm that looks ahead and concludes there are 16 columns
throughout. Yet a file where the first time a sixteenth entry is seen is at
line/row 31,459 may well just set the algorithm to expect exactly 15 columns
and then be surprised as noted above.

I have stayed out of this discussion and others have supplied pretty much
what I would have said. I also see the data as flawed and ask which rows are
the valid ones. If a sixteenth column is allowed, it would be better if all
other rows had an empty sixteenth column. If not allowed, none should have
it.

The approach I might take, again as others have noted, is to preprocess the
data file using some form of stream editor such as AWK that automagically
reads in a line at a time and parses lines into a collection of tokens based
on what separates them such as a comma. You can then either write out just
the first 15 to the output stream if your choice is to ignore a spurious
sixteenth, or write out all sixteen for every line, with the last being some
form of null most of the time. And, of course, to be more general, you could
make two passes through the file with the first one determining the maximum
number of entries as well as what the most common number of entries is, and
a second pass using that info to normalize the file the way you want. And
note some of what was mentioned could often be done in this preprocessing
such as removing any columns you do not want to read into R later. Do note
such filters may need to handle edge cases like skipping comment lines or
treating the row of headers differently.

As some have shown, you can create your own filters within a language like R
too and either read in lines and pre-process them as discussed or continue
on to making your own data.frame and skip the read.table() type of
functionality. For very large files, though, having multiple variations in
memory at once may be an issue, especially if they are not removed and
further processing and analysis continues.

Perhaps it might be sensible to contact those maintaining the data and point
out the anomaly and ask if their files might be saved alternately in a
format that can be used without anomalies.

Avi

-Original Message-
From: R-help  On Behalf Of Ebert,Timothy Aaron
Sent: Friday, September 30, 2022 7:27 AM
To: Richard O'Keefe ; Nick Wray 
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line
of data with 16 entries? 

Tim

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-30 Thread Ebert,Timothy Aaron
Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of 
data with 16 entries? 

Tim

-Original Message-
From: R-help  On Behalf Of Richard O'Keefe
Sent: Friday, September 30, 2022 12:08 AM
To: Nick Wray 
Cc: r-help@r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

If I had this problem, in the old days I'd've whipped up a tiny AWK script.  
These days I might use xsv or qsv.
BUT
first I would want to know why these extra fields are present and what they 
signify.  Are they good data that happen not to be described in the 
documentation?  Do they represent a defect in the generation process?  What 
other discrepancies are there?  If the data *format* cannot be fully trusted, 
what does that say about the data *content*?  Do other data sets from the same 
source have the same issue?  Is it possible to compare this version of the data 
with an earlier version?

On Fri, 30 Sept 2022 at 02:54, Nick Wray  wrote:

> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from 
> the UK Met Office in notepad form.  Unfortunately,  I can't read them 
> directly into R because, for some reason, although most lines in the 
> text doc consist of 15 elements, every so often there is a sixteenth 
> one and R doesn't like this and gives me an error message because it 
> has assumed that every line has 15 elements and doesn't like finding 
> one with more.  I have tried playing around with the text document, 
> inserting an extra element into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to 
> get the files in question so this link probably won't work:
>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcata
> logue.ceda.ac.uk%2Fuuid%2Fbbd6916225e7475514e17fdbf11141c1data=05
> %7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f8
> 4a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFp
> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> 0%3D%7C3000%7C%7C%7Csdata=FEHsv515QPe4iXFMLlx9jwj4JXka7asxg771h6s
> 5nVg%3Dreserved=0
>
> So what I have done is simply to copy and paste the text docs into 
> excel csv and then read them in, which is time-consuming but works.  
> However the later datasets are over the excel limit of 1048576 lines.  
> I can paste in the first 1048576 lines but then trying to isolate the 
> remainder of the text doc to paste it into a second csv doc is proving 
> v difficult - the only way I have found is to scroll down by hand and 
> that's taking ages.  I cannot find another way of editing the notepad 
> text doc to get rid of the part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text 
> tables into R  or b)suggest a way of editing out the bits of the text 
> file I have already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-helpdata=05%7C01%7Ctebert%40ufl
> .edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e
> 1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> sdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3Drese
> rved=0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.htmldata=05%7C01%7Ctebert%40ufl.edu%
> 7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%
> 7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> sdata=DOkkKe1P474ELVoFjMtqWXawwQ5ouRR3ofjQEBPXKVM%3Dreserved=0
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-helpdata=05%7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3Dreserved=0
PLEASE do read the posting guide 
https://nam10.safelinks.protection.outlook.c

Re: [R] Reading very large text files into R

2022-09-29 Thread Richard O'Keefe
If I had this problem, in the old days I'd've whipped up
a tiny AWK script.  These days I might use xsv or qsv.
BUT
first I would want to know why these extra fields are
present and what they signify.  Are they good data that
happen not to be described in the documentation?  Do
they represent a defect in the generation process?  What
other discrepancies are there?  If the data *format*
cannot be fully trusted, what does that say about the
data *content*?  Do other data sets from the same source
have the same issue?  Is it possible to compare this
version of the data with an earlier version?

On Fri, 30 Sept 2022 at 02:54, Nick Wray  wrote:

> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from the UK
> Met Office in notepad form.  Unfortunately,  I can’t read them directly
> into R because, for some reason, although most lines in the text doc
> consist of 15 elements, every so often there is a sixteenth one and R
> doesn’t like this and gives me an error message because it has assumed that
> every line has 15 elements and doesn’t like finding one with more.  I have
> tried playing around with the text document, inserting an extra element
> into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to get
> the files in question so this link probably won’t work:
>
> https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>
> So what I have done is simply to copy and paste the text docs into excel
> csv and then read them in, which is time-consuming but works.  However the
> later datasets are over the excel limit of 1048576 lines.  I can paste in
> the first 1048576 lines but then trying to isolate the remainder of the
> text doc to paste it into a second csv doc is proving v difficult – the
> only way I have found is to scroll down by hand and that’s taking ages.  I
> cannot find another way of editing the notepad text doc to get rid of the
> part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text tables
> into R  or b)suggest a way of editing out the bits of the text file I have
> already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Dr Eberhard W Lisse
To me this file looks like a CSV with 15 fields (on each line) not 16,
the last field being empty with the exception of the one which has the
'B'.  The 14th is always empty.

I also note that it does not seem to have a new line at the end.


I can strongly recommend QSV to manipulate CSV files and CSVIEW to look
at them

After renaming the file for convenience you can do something like

qsv input --trim-fields --trim-headers sample.csv \
| qsv select -n "1,2,6,7,8,9,10" \
| qsv rename "date,c2,type,c4,c5,c6,c7" \
| csview -i5 -np0

and get something like

┌──┬┬──┬───┬┬┬──┬──┐
│# │  date  │  c2  │ type  │ c4 │ c5 │c6│c7│
├──┼┼──┼───┼┼┼──┼──┤
│1 │1980-01-01 10:00│226918│WAHRAIN│5124│1001│0 │  │
│2 │1980-01-01 10:00│228562│WAHRAIN│491 │1001│0 │  │
│3 │1980-01-01 10:00│231581│WAHRAIN│5213│1001│0 │  │
│4 │1980-01-01 10:00│232671│WAHRAIN│487 │1001│0 │  │
│5 │1980-01-01 10:00│232913│WAHRAIN│5243│1001│0 │  │
│6 │1980-01-01 10:00│234362│WAHRAIN│5265│1001│0 │  │
│7 │1980-01-01 10:00│234682│WAHRAIN│5271│1001│0 │  │
│8 │1980-01-01 10:00│235389│WAHRAIN│5279│1001│0 │  │
│9 │1980-01-01 10:00│236466│WAHRAIN│497 │1001│0 │  │
│10│1980-01-01 10:00│243350│SREW   │484 │1001│0 │  │
│11│1980-01-01 10:00│243350│WAHRAIN│484 │1001│0 │0 │
└──┴┴──┴───┴┴┴──┴──┘

As the files do not have headers, you could, if you have multiple files,
even do something like

qsv cat rows s*.csv \
| qsv input --trim-fields --trim-headers \
| qsv select -n "1,2,6,7,8,9,10" \
| qsv rename "date,c2,type,c4,c5,c6,c7" \
| qsv dedup 2>/dev/null -o readmeintoR.csv


If it was REALLY a file with different numbers of fields you can use
CSVQ and do something like

cat s*csv \
| csvq --format CSV --no-header --allow-uneven-fields \
"SELECT c1 as date, c2, c6 as type, c7 as c4,
  c8 as c5, c9 as c6, c10 as c7
FROM stdin" \
| qsv input --trim-fields --trim-headers \
| qsv dedup 2>/dev/null -o readmeintoR.csv

And, finally, depending on how long the reading of the CSV takes, I
would save it into a RDS, loading of which is very fast.


greetings, el

On 2022-09-29 17:26 , Nick Wray wrote:
> Hi Bert   
> 
> Right Thing is, I didn't know that there even was an instruction like
> read.csv(text = "...  your text...  ") so at any rate I can paste the
> original text files in by hand if there's no shorter cut
> Thanks v much Nick
[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Nick Wray
Hello Ivan's suggestion of fill=T seems to do the trick.  Thanks to
everyone who piled in - I'm rather touched by the support seeing as this
was causing me a big headache with furthering my project.  I also feel
humbled by realising how little I know about the R-universe... Nick

On Thu, 29 Sept 2022 at 15:09, Ivan Krylov  wrote:

> В Thu, 29 Sep 2022 14:54:10 +0100
> Nick Wray  пишет:
>
> > although most lines in the text doc consist of 15 elements, every so
> > often there is a sixteenth one and R doesn’t like this and gives me
> > an error message
>
> Does the fill = TRUE argument of read.table() help?
>
> If not, could you construct and share a small file with the same kind
> of problem (16th field) but without the data one has to apply for
> access to? (E.g. cut out a few lines from the original file, then
> replace all digits.)
>
> --
> Best regards,
> Ivan
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Jeff Newmiller
"Confusion" is the size of the file. Try specifying the colClasses argument to 
nail down the number and type of the columns.

On September 29, 2022 8:16:34 AM PDT, Bert Gunter  
wrote:
>I had no trouble reading your text snippet with
>read.csv(text =
>"... your text... ")
>
>There were 15 columns. The last column was all empty except for the row
>containing the "B".
>
>So there seems to be some confusion here.
>
>-- Bert
>
>
>
>
>
>
>On Thu, Sep 29, 2022 at 6:54 AM Nick Wray  wrote:
>
>> Hello   I may be offending the R purists with this question but it is
>> linked to R, as will become clear.  I have very large data sets from the UK
>> Met Office in notepad form.  Unfortunately,  I can’t read them directly
>> into R because, for some reason, although most lines in the text doc
>> consist of 15 elements, every so often there is a sixteenth one and R
>> doesn’t like this and gives me an error message because it has assumed that
>> every line has 15 elements and doesn’t like finding one with more.  I have
>> tried playing around with the text document, inserting an extra element
>> into the top line etc, but to no avail.
>>
>> Also unfortunately you need access permission from the Met Office to get
>> the files in question so this link probably won’t work:
>>
>> https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>>
>> So what I have done is simply to copy and paste the text docs into excel
>> csv and then read them in, which is time-consuming but works.  However the
>> later datasets are over the excel limit of 1048576 lines.  I can paste in
>> the first 1048576 lines but then trying to isolate the remainder of the
>> text doc to paste it into a second csv doc is proving v difficult – the
>> only way I have found is to scroll down by hand and that’s taking ages.  I
>> cannot find another way of editing the notepad text doc to get rid of the
>> part which I have already copied and pasted.
>>
>> Can anyone help with a)ideally being able to simply read the text tables
>> into R  or b)suggest a way of editing out the bits of the text file I have
>> already pasted in without laborious scrolling?
>>
>> Thanks Nick Wray
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Nick Wray
Hi Bert   Right Thing is, I didn't know that there even was an instruction
like read.csv(text =
"... your text... ")  so at any rate I can paste the original text files in
by hand if there's no shorter cut
Thanks v much Nick

On Thu, 29 Sept 2022 at 16:16, Bert Gunter  wrote:

> I had no trouble reading your text snippet with
> read.csv(text =
> "... your text... ")
>
> There were 15 columns. The last column was all empty except for the row
> containing the "B".
>
> So there seems to be some confusion here.
>
> -- Bert
>
>
>
>
>
>
> On Thu, Sep 29, 2022 at 6:54 AM Nick Wray  wrote:
>
>> Hello   I may be offending the R purists with this question but it is
>> linked to R, as will become clear.  I have very large data sets from the
>> UK
>> Met Office in notepad form.  Unfortunately,  I can’t read them directly
>> into R because, for some reason, although most lines in the text doc
>> consist of 15 elements, every so often there is a sixteenth one and R
>> doesn’t like this and gives me an error message because it has assumed
>> that
>> every line has 15 elements and doesn’t like finding one with more.  I have
>> tried playing around with the text document, inserting an extra element
>> into the top line etc, but to no avail.
>>
>> Also unfortunately you need access permission from the Met Office to get
>> the files in question so this link probably won’t work:
>>
>> https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>>
>> So what I have done is simply to copy and paste the text docs into excel
>> csv and then read them in, which is time-consuming but works.  However the
>> later datasets are over the excel limit of 1048576 lines.  I can paste in
>> the first 1048576 lines but then trying to isolate the remainder of the
>> text doc to paste it into a second csv doc is proving v difficult – the
>> only way I have found is to scroll down by hand and that’s taking ages.  I
>> cannot find another way of editing the notepad text doc to get rid of the
>> part which I have already copied and pasted.
>>
>> Can anyone help with a)ideally being able to simply read the text tables
>> into R  or b)suggest a way of editing out the bits of the text file I have
>> already pasted in without laborious scrolling?
>>
>> Thanks Nick Wray
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Bert Gunter
I had no trouble reading your text snippet with
read.csv(text =
"... your text... ")

There were 15 columns. The last column was all empty except for the row
containing the "B".

So there seems to be some confusion here.

-- Bert






On Thu, Sep 29, 2022 at 6:54 AM Nick Wray  wrote:

> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from the UK
> Met Office in notepad form.  Unfortunately,  I can’t read them directly
> into R because, for some reason, although most lines in the text doc
> consist of 15 elements, every so often there is a sixteenth one and R
> doesn’t like this and gives me an error message because it has assumed that
> every line has 15 elements and doesn’t like finding one with more.  I have
> tried playing around with the text document, inserting an extra element
> into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to get
> the files in question so this link probably won’t work:
>
> https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>
> So what I have done is simply to copy and paste the text docs into excel
> csv and then read them in, which is time-consuming but works.  However the
> later datasets are over the excel limit of 1048576 lines.  I can paste in
> the first 1048576 lines but then trying to isolate the remainder of the
> text doc to paste it into a second csv doc is proving v difficult – the
> only way I have found is to scroll down by hand and that’s taking ages.  I
> cannot find another way of editing the notepad text doc to get rid of the
> part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text tables
> into R  or b)suggest a way of editing out the bits of the text file I have
> already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Jan van der Laan
You're sure the extra column is indeed an extra column? According to the 
documentation 
(https://artefacts.ceda.ac.uk/badc_datadocs/ukmo-midas/RH_Table.html) 
there should be 15 columns.


Could it, for example, be that one of the columns contains records with 
commas?


Jan



On 29-09-2022 15:54, Nick Wray wrote:

Hello   I may be offending the R purists with this question but it is
linked to R, as will become clear.  I have very large data sets from the UK
Met Office in notepad form.  Unfortunately,  I can’t read them directly
into R because, for some reason, although most lines in the text doc
consist of 15 elements, every so often there is a sixteenth one and R
doesn’t like this and gives me an error message because it has assumed that
every line has 15 elements and doesn’t like finding one with more.  I have
tried playing around with the text document, inserting an extra element
into the top line etc, but to no avail.

Also unfortunately you need access permission from the Met Office to get
the files in question so this link probably won’t work:

https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1

So what I have done is simply to copy and paste the text docs into excel
csv and then read them in, which is time-consuming but works.  However the
later datasets are over the excel limit of 1048576 lines.  I can paste in
the first 1048576 lines but then trying to isolate the remainder of the
text doc to paste it into a second csv doc is proving v difficult – the
only way I have found is to scroll down by hand and that’s taking ages.  I
cannot find another way of editing the notepad text doc to get rid of the
part which I have already copied and pasted.

Can anyone help with a)ideally being able to simply read the text tables
into R  or b)suggest a way of editing out the bits of the text file I have
already pasted in without laborious scrolling?

Thanks Nick Wray

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Ben Tupper
Hi Nick,

It's hard to know without seeing at least a snippet of the data.
Could you do the following and paste the result into a plain text
email?  If you don't set your email client to plain text (from rich
text or html) then we are apt to see a jumble of output on our email
clients.


## start
x <- readLines(filename, n = 20)
cat(x, sep = "\n")
## end

Cheers,
Ben


On Thu, Sep 29, 2022 at 9:54 AM Nick Wray  wrote:
>
> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from the UK
> Met Office in notepad form.  Unfortunately,  I can’t read them directly
> into R because, for some reason, although most lines in the text doc
> consist of 15 elements, every so often there is a sixteenth one and R
> doesn’t like this and gives me an error message because it has assumed that
> every line has 15 elements and doesn’t like finding one with more.  I have
> tried playing around with the text document, inserting an extra element
> into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to get
> the files in question so this link probably won’t work:
>
> https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>
> So what I have done is simply to copy and paste the text docs into excel
> csv and then read them in, which is time-consuming but works.  However the
> later datasets are over the excel limit of 1048576 lines.  I can paste in
> the first 1048576 lines but then trying to isolate the remainder of the
> text doc to paste it into a second csv doc is proving v difficult – the
> only way I have found is to scroll down by hand and that’s taking ages.  I
> cannot find another way of editing the notepad text doc to get rid of the
> part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text tables
> into R  or b)suggest a way of editing out the bits of the text file I have
> already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Ben Tupper (he/him)
Bigelow Laboratory for Ocean Science
East Boothbay, Maine
http://www.bigelow.org/
https://eco.bigelow.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading very large text files into R

2022-09-29 Thread Ivan Krylov
В Thu, 29 Sep 2022 14:54:10 +0100
Nick Wray  пишет:

> although most lines in the text doc consist of 15 elements, every so
> often there is a sixteenth one and R doesn’t like this and gives me
> an error message

Does the fill = TRUE argument of read.table() help?

If not, could you construct and share a small file with the same kind
of problem (16th field) but without the data one has to apply for
access to? (E.g. cut out a few lines from the original file, then
replace all digits.)

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Reading very large text files into R

2022-09-29 Thread Nick Wray
Hello   I may be offending the R purists with this question but it is
linked to R, as will become clear.  I have very large data sets from the UK
Met Office in notepad form.  Unfortunately,  I can’t read them directly
into R because, for some reason, although most lines in the text doc
consist of 15 elements, every so often there is a sixteenth one and R
doesn’t like this and gives me an error message because it has assumed that
every line has 15 elements and doesn’t like finding one with more.  I have
tried playing around with the text document, inserting an extra element
into the top line etc, but to no avail.

Also unfortunately you need access permission from the Met Office to get
the files in question so this link probably won’t work:

https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1

So what I have done is simply to copy and paste the text docs into excel
csv and then read them in, which is time-consuming but works.  However the
later datasets are over the excel limit of 1048576 lines.  I can paste in
the first 1048576 lines but then trying to isolate the remainder of the
text doc to paste it into a second csv doc is proving v difficult – the
only way I have found is to scroll down by hand and that’s taking ages.  I
cannot find another way of editing the notepad text doc to get rid of the
part which I have already copied and pasted.

Can anyone help with a)ideally being able to simply read the text tables
into R  or b)suggest a way of editing out the bits of the text file I have
already pasted in without laborious scrolling?

Thanks Nick Wray

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.