Re: [R] Exceptional slowness with read.csv

2024-04-10 Thread avi.e.gross
Dave,

Your method works for you and seems to be a one-time fix of a corrupted data 
file so please accept what I write not as a criticism but explaining my 
alternate reasoning which I suspect may work faster in some situations.

Here is my understanding of what you are doing:

You have a file in CSV format containing N rows with commas to make M columns. 
A few rows have a glitch in that there is a double quote character at the 
beginning or end (meaning between commas adjacent to one, or perhaps at the 
beginning or end of the line of text) that mess things up. This may be in a 
specific known column or in several.

So your algorithm is to read the entire file in, or alternately you could do 
one at a time. Note the types of the columns may not be apparent to you when 
you start as you are not allowing read.csv() see what it needs to or perform 
all kinds of processing like dealing with a comment.
You then call functions millions of times (N) such as read.csv(). Argh!

You do that by setting up an environment N times to catch errors. Of course, 
most lines are fine and no error.

Only on error lines do you check for a regular expression that checks for 
quotes not immediately adjacent to a comma. I am not sure what you used albeit 
I imagine sometimes spaces could intervene. You fix any such lines and 
re-evaluate.

It seems your goal was to rewrite a corrected file so you are doing so while 
appending to it a row/line at a time.

My strategy was a bit different.

- Call read.csv() just once with no error checking but an option to not treat a 
quote specially. Note if the quoted region may contain commas, this is a bad 
strategy. If all it has is spaces or other non-comma items, it may be fine. 

There is now a data.frame or other similar data structure in memory if it works 
with N rows and M columns.

- Pick only columns that may have this issue, meaning the ones containing say 
text as compared to numbers or logical values.
- Using those columns, perhaps one at a time, evaluate them all at once for a 
regular expression that tests the entry for the presence of exactly one quote 
either at the start or end (the commas you used as anchors are not in this 
version.) So you are looking for something like:

"words perhaps including, commas
Or
words perhaps including, commas"

but not for:

words perhaps including, commas
"words perhaps including, commas"

You can save the query as a Boolean vector of TRUE/FALSE as one method, to mark 
which rows need fixing. Or you might use an ifelse() or the equivalent in which 
you selectively apply a fix to the rows. One method is to use something like 
sub() to both match all text except an initial or terminal quote and replace it 
with a quote followed by the match followed by a quote, if any quotes were 
found.

Whatever you choose can be done in a vectorized manner that may be more 
efficient. You do not need to check for failures, let alone N times. And you 
only need process those columns that need it.

When done, you may want to make sure all the columns are of the type you want 
as who knows if read.csv() made a bad choice on those columns, or others.

Note again, this is only a suggestion and it fails if commas can be part of the 
quoted parts or even misquoted parts.

-Original Message-
From: R-help  On Behalf Of Dave Dixon
Sent: Wednesday, April 10, 2024 12:20 PM
To: Rui Barradas ; r-help@r-project.org
Subject: Re: [R] Exceptional slowness with read.csv

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)
3. in the catch, use gregexpr to find any quotes not adjacent to a comma 
(gregexpr("[^,]\"[^,]",...)
4. escape any quotes found by adding a second quote (using str_sub from 
stringr)
6. parse the patched text using read.csv(text=...)
7. write out the parsed fields as I go along using write.table(..., 
append=TRUE) so I'm not keeping too much in memory.

I went directly to tryCatch because there were 3.5 million records, and 
I only expected a few to have errors.

I found only 6 bad records, but it had to be done to make the datafile 
usable with read.csv(), for the benefit of other researchers using these 
data.


On 4/10/24 07:46, Rui Barradas wrote:
> Às 06:47 de 08/04/2024, Dave Dixon escreveu:
>> Greetings,
>>
>> I have a csv file of 76 fields and about 4 million records. I know 
>> that some of the records have errors - unmatched quotes, 
>> specifically. Reading the file with readLines and parsing the lines 
>> with read.csv(text = ...) is really slow. I know that the first 
>> 2459465 records are good. So I try this:
>>
>>  > startTime <- Sys.time()
>>  > first_records <- read.csv(file_name, nrows = 2459465)
>>  > endTime <- Sys.time()
>>  > cat("elapsed time = ", endTime - startTime, "\n")
>>
>> elapsed time 

Re: [R] Exceptional slowness with read.csv

2024-04-10 Thread Dave Dixon

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)
3. in the catch, use gregexpr to find any quotes not adjacent to a comma 
(gregexpr("[^,]\"[^,]",...)
4. escape any quotes found by adding a second quote (using str_sub from 
stringr)

6. parse the patched text using read.csv(text=...)
7. write out the parsed fields as I go along using write.table(..., 
append=TRUE) so I'm not keeping too much in memory.


I went directly to tryCatch because there were 3.5 million records, and 
I only expected a few to have errors.


I found only 6 bad records, but it had to be done to make the datafile 
usable with read.csv(), for the benefit of other researchers using these 
data.



On 4/10/24 07:46, Rui Barradas wrote:

Às 06:47 de 08/04/2024, Dave Dixon escreveu:

Greetings,

I have a csv file of 76 fields and about 4 million records. I know 
that some of the records have errors - unmatched quotes, 
specifically. Reading the file with readLines and parsing the lines 
with read.csv(text = ...) is really slow. I know that the first 
2459465 records are good. So I try this:


 > startTime <- Sys.time()
 > first_records <- read.csv(file_name, nrows = 2459465)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

 > startTime <- Sys.time()
 > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude 
longer than (nrows = 2459465) ?


Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the 
output in a two column data.frame with columns


 Col - the column processed;
 Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult 
to adapt it to other cas, single quotes, both cases.





unbalanced_dquotes <- function(x) {
  char_cols <- sapply(x, is.character) |> which()
  lapply(char_cols, \(i) {
    y <- x[[i]]
    Unbalanced <- gregexpr('"', y) |>
  sapply(\(x) attr(x, "match.length") |> length()) |>
  {\(x) (x %% 2L) == 1L}() |>
  which()
    data.frame(Col = i, Unbalanced = Unbalanced)
  }) |>
  do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-10 Thread avi.e.gross
It sounds like the discussion is now on how to clean your data, with a twist. 
You want to clean it before you can properly read it in using standard methods.

Some of those standard methods already do quite a bit as they parse the data 
such as looking ahead to determine the data type for a column.

The specific problem being discussed seems to be related to a lack of balance 
in individual lines of a CSV file related to double quotes that then mess up 
that row and following rows for a while. I am not clear on the meaning of the 
quotes to the user but wonder if they can simply not be viewed as quotes. 
Functions like read.csv() or the tidyverse variant of read_csv() allow you to 
specify the quote character or disable it.

So what would happen to the damaged line/row in your case, or any row with both 
quotes intact if you tried reading it in with an argument disabling processing 
quoted regions? It may cause problems but in your case, maybe it won't.

If so, after reading in the file, you can march through it and make fixes, such 
as discussed. The other alternative seems to be to read the lines in the 
old-fashioned way, do some surgery on whole lines rather than individual 
row/column entries, and perhaps feed the huge amount of data in some form to 
read.csv as text=TEXT or write it out to another file and read it in again.

And, of course, if there is just one bad line, then you might just open it with 
a program such as EXCEL or anything that lets you edit it once, ...




-Original Message-
From: R-help  On Behalf Of Rui Barradas
Sent: Wednesday, April 10, 2024 9:46 AM
To: Dave Dixon ; r-help@r-project.org
Subject: Re: [R] Exceptional slowness with read.csv

Às 06:47 de 08/04/2024, Dave Dixon escreveu:
> Greetings,
> 
> I have a csv file of 76 fields and about 4 million records. I know that 
> some of the records have errors - unmatched quotes, specifically. 
> Reading the file with readLines and parsing the lines with read.csv(text 
> = ...) is really slow. I know that the first 2459465 records are good. 
> So I try this:
> 
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> elapsed time =   24.12598
> 
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> This appears to never finish. I have been waiting over 20 minutes.
> 
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
> than (nrows = 2459465) ?
> 
> Thanks!
> 
> -dave
> 
> PS: readLines(n=2459470) takes 10.42731 seconds.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the output 
in a two column data.frame with columns

  Col - the column processed;
  Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult to 
adapt it to other cas, single quotes, both cases.




unbalanced_dquotes <- function(x) {
   char_cols <- sapply(x, is.character) |> which()
   lapply(char_cols, \(i) {
 y <- x[[i]]
 Unbalanced <- gregexpr('"', y) |>
   sapply(\(x) attr(x, "match.length") |> length()) |>
   {\(x) (x %% 2L) == 1L}() |>
   which()
 data.frame(Col = i, Unbalanced = Unbalanced)
   }) |>
   do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-10 Thread Rui Barradas

Às 06:47 de 08/04/2024, Dave Dixon escreveu:

Greetings,

I have a csv file of 76 fields and about 4 million records. I know that 
some of the records have errors - unmatched quotes, specifically. 
Reading the file with readLines and parsing the lines with read.csv(text 
= ...) is really slow. I know that the first 2459465 records are good. 
So I try this:


 > startTime <- Sys.time()
 > first_records <- read.csv(file_name, nrows = 2459465)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

 > startTime <- Sys.time()
 > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
than (nrows = 2459465) ?


Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the output 
in a two column data.frame with columns


 Col - the column processed;
 Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult to 
adapt it to other cas, single quotes, both cases.





unbalanced_dquotes <- function(x) {
  char_cols <- sapply(x, is.character) |> which()
  lapply(char_cols, \(i) {
y <- x[[i]]
Unbalanced <- gregexpr('"', y) |>
  sapply(\(x) attr(x, "match.length") |> length()) |>
  {\(x) (x %% 2L) == 1L}() |>
  which()
data.frame(Col = i, Unbalanced = Unbalanced)
  }) |>
  do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread jim holtman
Try reading the lines in (readLines), count the number of both types of
quotes in each line. Find out which are not even and investigate.

On Mon, Apr 8, 2024, 15:24 Dave Dixon  wrote:

> I solved the mystery, but not the problem. The problem is that there's
> an unclosed quote somewhere in those 5 additional records I'm trying to
> access. So read.csv is reading million-character fields. It's slow at
> that. That mystery solved.
>
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv can't
> handle. readLines followed by read.csv(text= ) works great because, in
> that case, read.csv knows where the record terminates. Meaning, read.csv
> throws an exception that I can catch and handle with a quick and clean
> regex expression.
>
> Thanks, I'll take a look at vroom.
>
> -dave
>
> On 4/8/24 09:18, Stevie Pederson wrote:
> > Hi Dave,
> >
> > That's rather frustrating. I've found vroom (from the package vroom)
> > to be helpful with large files like this.
> >
> > Does the following give you any better luck?
> >
> > vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
> >
> > Of course, when you know you've got errors & the files are big like
> > that it can take a bit of work resolving things. The command line
> > tools awk & sed might even be a good plan for finding lines that have
> > errors & figuring out a fix, but I certainly don't envy you.
> >
> > All the best
> >
> > Stevie
> >
> > On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
> >
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I know
> > that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with
> > read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are
> > good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of magnitude
> > longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > 
> > and provide commented, minimal, self-contained, reproducible code.
> >
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Eberhard W Lisse
I find QSV very helpful.

el

On 08/04/2024 22:21, Dave Dixon wrote:
> I solved the mystery, but not the problem. The problem is that
> there's an unclosed quote somewhere in those 5 additional records I'm
> trying to access. So read.csv is reading million-character fields.
> It's slow at that. That mystery solved.
> 
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv
> can't handle. readLines followed by read.csv(text= ) works great
> because, in that case, read.csv knows where the record terminates.
> Meaning, read.csv throws an exception that I can catch and handle
> with a quick and clean regex expression.
> 
> Thanks, I'll take a look at vroom.
[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Right, I meant to add header=FALSE. And, it looks now like the next line 
is the one with the unclosed quote, so read.csv is trying to read 
million-character headers!


On 4/8/24 12:42, Ivan Krylov wrote:

В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:


  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)

It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.

Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Good suggestion - I'll look into data.table.

On 4/8/24 12:14, CALUM POLWART wrote:
> data.table's fread is also fast. Not sure about error handling. But I 
> can merge 300 csvs with a total of 0.5m lines and 50 columns in a 
> couple of minutes versus a lifetime with read.csv or readr::read_csv
>
>
>
> On Mon, 8 Apr 2024, 16:19 Stevie Pederson, 
>  wrote:
>
> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package
> vroom) to be
> helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big
> like that it
> can take a bit of work resolving things. The command line tools
> awk & sed
> might even be a good plan for finding lines that have errors &
> figuring out
> a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I
> know that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with
> read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are
> good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of
> magnitude longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> 
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Thanks, yeah, I think scan is more promising. I'll check it out.

On 4/8/24 11:49, Bert Gunter wrote:
> No idea, but have you tried using ?scan to read those next 5 rows? It 
> might give you a better idea of the pathologies that are causing 
> problems. For example, an unmatched quote might result in some huge 
> number of characters trying to be read into a single element of a 
> character variable. As your previous respondent said, resolving such 
> problems can be a challenge.
>
> Cheers,
> Bert
>
>
>
> On Mon, Apr 8, 2024 at 8:06 AM Dave Dixon  wrote:
>
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know
> that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with
> read.csv(text
> = ...) is really slow. I know that the first 2459465 records are
> good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude
> longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
I solved the mystery, but not the problem. The problem is that there's 
an unclosed quote somewhere in those 5 additional records I'm trying to 
access. So read.csv is reading million-character fields. It's slow at 
that. That mystery solved.

However, the the problem persists: how to fix what is obvious to the 
naked eye - a quote not adjacent to a comma - but that read.csv can't 
handle. readLines followed by read.csv(text= ) works great because, in 
that case, read.csv knows where the record terminates. Meaning, read.csv 
throws an exception that I can catch and handle with a quick and clean 
regex expression.

Thanks, I'll take a look at vroom.

-dave

On 4/8/24 09:18, Stevie Pederson wrote:
> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) 
> to be helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like 
> that it can take a bit of work resolving things. The command line 
> tools awk & sed might even be a good plan for finding lines that have 
> errors & figuring out a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know
> that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with
> read.csv(text
> = ...) is really slow. I know that the first 2459465 records are
> good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude
> longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Rui Barradas

Às 19:42 de 08/04/2024, Ivan Krylov via R-help escreveu:

В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:


  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)


It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.



This will throw an error, call read.table with sep="," instead.




Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.



Hoep this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Ivan Krylov via R-help
В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:

>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)

It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.

Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread CALUM POLWART
data.table's fread is also fast. Not sure about error handling. But I can
merge 300 csvs with a total of 0.5m lines and 50 columns in a couple of
minutes versus a lifetime with read.csv or readr::read_csv



On Mon, 8 Apr 2024, 16:19 Stevie Pederson, 
wrote:

> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) to be
> helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like that it
> can take a bit of work resolving things. The command line tools awk & sed
> might even be a good plan for finding lines that have errors & figuring out
> a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I know that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Bert Gunter
No idea, but have you tried using ?scan to read those next 5 rows? It might
give you a better idea of the pathologies that are causing problems. For
example, an unmatched quote might result in some huge number of characters
trying to be read into a single element of a character variable. As your
previous respondent said, resolving such problems can be a challenge.

Cheers,
Bert



On Mon, Apr 8, 2024 at 8:06 AM Dave Dixon  wrote:

> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Stevie Pederson
Hi Dave,

That's rather frustrating. I've found vroom (from the package vroom) to be
helpful with large files like this.

Does the following give you any better luck?

vroom(file_name, delim = ",", skip = 2459465, n_max = 5)

Of course, when you know you've got errors & the files are big like that it
can take a bit of work resolving things. The command line tools awk & sed
might even be a good plan for finding lines that have errors & figuring out
a fix, but I certainly don't envy you.

All the best

Stevie

On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:

> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon

Greetings,

I have a csv file of 76 fields and about 4 million records. I know that 
some of the records have errors - unmatched quotes, specifically.  
Reading the file with readLines and parsing the lines with read.csv(text 
= ...) is really slow. I know that the first 2459465 records are good. 
So I try this:


> startTime <- Sys.time()
> first_records <- read.csv(file_name, nrows = 2459465)
> endTime <- Sys.time()
> cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

> startTime <- Sys.time()
> second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> endTime <- Sys.time()
> cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
than (nrows = 2459465) ?


Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.