Thank you Bert, Jeff and David for great answers.
Let me provide more context to clarify the question:
- I am running this on a large server (512GB), so the data still fits
into memory (and I also know how to process in chunks if necessary)
- I agree that DBMS and other software would me better
Hi Lucas,
This is a rough outline of something I programmed years ago for data
cleaning (that was programmed in C). The basic idea is to read the
file line by line and check for a problem (in the initial application
this was a discrepancy between two lines that were supposed to be
identical).
> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation
> wrote:
>
> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
>
?readLines ... given the large size of file you may need to process chunks by
specifying a file connection rather than a character string file name and using
the "n" argument.
?grepl
?Extract
?tools::showNonASCII
There are many ways for data to be corrupted... in particular when invalid
I have some large .txt files about ~100GB containing a dataset in fixed
width file. This contains some errors:
- character characters in column that are supposed to be numeric,
- invalid characters
- rows with too many characters, possibly due to invalid characters or some
missing end of line
5 matches
Mail list logo