Re: [R] Controlling number of numbers before R rewrites to "+e18" etc
You can always read a portion of the file and then write it out. For large files, I will read in 10,000 line, fix them up and then write them out and go back and process the next batch of lines. You haven't shown us what a sample of your input/output is, or how you are processing them. Depending on what type of preprocessing needs to be done to the data, PERL is also an option. But most things I used to use PERL for, I can do within R these days. Here is an example of reading in your IDs: > x <- read.table(textConnection("1234567890123456789012 > 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543 + 1234567890123456789012 987654321234567898765432 98765432123456789876543") + , colClasses = rep('character', 3)) > closeAllConnections() > str(x) 'data.frame': 7 obs. of 3 variables: $ V1: chr "1234567890123456789012" "1234567890123456789012" "1234567890123456789012" "1234567890123456789012" ... $ V2: chr "987654321234567898765432" "987654321234567898765432" "987654321234567898765432" "987654321234567898765432" ... $ V3: chr "98765432123456789876543" "98765432123456789876543" "98765432123456789876543" "98765432123456789876543" ... > x V1 V2 V3 1 1234567890123456789012 987654321234567898765432 98765432123456789876543 2 1234567890123456789012 987654321234567898765432 98765432123456789876543 3 1234567890123456789012 987654321234567898765432 98765432123456789876543 4 1234567890123456789012 987654321234567898765432 98765432123456789876543 5 1234567890123456789012 987654321234567898765432 98765432123456789876543 6 1234567890123456789012 987654321234567898765432 98765432123456789876543 7 1234567890123456789012 987654321234567898765432 98765432123456789876543 On Mon, Oct 25, 2010 at 4:41 AM, ZeMajik wrote: > Thanks Jim, but I still got the problem that the pre-processing becomes way > too computationally expensive. R seems to handle characters and factors much > much worse than numeric IDs. I don't have enough RAM to even write the file > when they are viewed as chars instead of numeric values! > > Anyone have any other ideas? Is it not possible to tell R not to rewrite > upon import? It wouldn't matter if it only would write the correct IDs to > the exported csv file, but it exports the abbreviated version which is of no > use. > > Mike > > On Sat, Oct 23, 2010 at 3:56 AM, jim holtman wrote: >> >> Your best bet is to make sure that you read the IDs in as characters. >> If they are being read in as floating point numbers, then there is >> only 15 digits of accuracy, so if you have IDs 18-22 digits, you will >> be missing data. So if you are using read.table, then look at >> colClasses to see how to do this. >> >> Provide a subset of your data and the statements that you are using to >> read in the data. >> >> On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik wrote: >> > Hey, >> > >> > I'm using R as a pre-processor for a large dataset with IDs which are >> > numeric (but has no numeric meaning so can be seen as factors). >> > I do some data formating and then write it out to a csv file. >> > >> > However the problem is that the IDs are very long, 18-22 chars long more >> > precisely. R is constantly rewriting these IDs to the abbreviated +eX >> > which >> > hinders me from exporting the data to the csv since the IDs are no >> > longer >> > intact. >> > I've tried telling R that ID column is a factor, but this results in two >> > problems: 1) Since I have millions of rows and R is slower handling >> > factors >> > than numbers my comp can't run the process in any kind of reasonable >> > time. >> > and 2) Some IDs STILL seem to be rewritten somehow. The second point >> > made me >> > believe that perhaps R is rewriting upon import? >> > >> > Does anyone have any tips on how to solve this problem? >> > >> > Thanks, >> > Mike >> > >> > [[alternative HTML version deleted]] >> > >> > __ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem that you are trying to solve? > > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-pr
Re: [R] Controlling number of numbers before R rewrites to "+e18" etc
Thanks Jim, but I still got the problem that the pre-processing becomes way too computationally expensive. R seems to handle characters and factors much much worse than numeric IDs. I don't have enough RAM to even write the file when they are viewed as chars instead of numeric values! Anyone have any other ideas? Is it not possible to tell R not to rewrite upon import? It wouldn't matter if it only would write the correct IDs to the exported csv file, but it exports the abbreviated version which is of no use. Mike On Sat, Oct 23, 2010 at 3:56 AM, jim holtman wrote: > Your best bet is to make sure that you read the IDs in as characters. > If they are being read in as floating point numbers, then there is > only 15 digits of accuracy, so if you have IDs 18-22 digits, you will > be missing data. So if you are using read.table, then look at > colClasses to see how to do this. > > Provide a subset of your data and the statements that you are using to > read in the data. > > On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik wrote: > > Hey, > > > > I'm using R as a pre-processor for a large dataset with IDs which are > > numeric (but has no numeric meaning so can be seen as factors). > > I do some data formating and then write it out to a csv file. > > > > However the problem is that the IDs are very long, 18-22 chars long more > > precisely. R is constantly rewriting these IDs to the abbreviated +eX > which > > hinders me from exporting the data to the csv since the IDs are no longer > > intact. > > I've tried telling R that ID column is a factor, but this results in two > > problems: 1) Since I have millions of rows and R is slower handling > factors > > than numbers my comp can't run the process in any kind of reasonable > time. > > and 2) Some IDs STILL seem to be rewritten somehow. The second point made > me > > believe that perhaps R is rewriting upon import? > > > > Does anyone have any tips on how to solve this problem? > > > > Thanks, > > Mike > > > >[[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Controlling number of numbers before R rewrites to "+e18" etc
Your best bet is to make sure that you read the IDs in as characters. If they are being read in as floating point numbers, then there is only 15 digits of accuracy, so if you have IDs 18-22 digits, you will be missing data. So if you are using read.table, then look at colClasses to see how to do this. Provide a subset of your data and the statements that you are using to read in the data. On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik wrote: > Hey, > > I'm using R as a pre-processor for a large dataset with IDs which are > numeric (but has no numeric meaning so can be seen as factors). > I do some data formating and then write it out to a csv file. > > However the problem is that the IDs are very long, 18-22 chars long more > precisely. R is constantly rewriting these IDs to the abbreviated +eX which > hinders me from exporting the data to the csv since the IDs are no longer > intact. > I've tried telling R that ID column is a factor, but this results in two > problems: 1) Since I have millions of rows and R is slower handling factors > than numbers my comp can't run the process in any kind of reasonable time. > and 2) Some IDs STILL seem to be rewritten somehow. The second point made me > believe that perhaps R is rewriting upon import? > > Does anyone have any tips on how to solve this problem? > > Thanks, > Mike > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Controlling number of numbers before R rewrites to "+e18" etc
Hey, I'm using R as a pre-processor for a large dataset with IDs which are numeric (but has no numeric meaning so can be seen as factors). I do some data formating and then write it out to a csv file. However the problem is that the IDs are very long, 18-22 chars long more precisely. R is constantly rewriting these IDs to the abbreviated +eX which hinders me from exporting the data to the csv since the IDs are no longer intact. I've tried telling R that ID column is a factor, but this results in two problems: 1) Since I have millions of rows and R is slower handling factors than numbers my comp can't run the process in any kind of reasonable time. and 2) Some IDs STILL seem to be rewritten somehow. The second point made me believe that perhaps R is rewriting upon import? Does anyone have any tips on how to solve this problem? Thanks, Mike [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.