Re: [R] read.delim skips first column (why?)
This should work: junk -read.table(fd0edfab.txt, sep=, header=T, fill=F,quote= ) HIH Paolo Sonego __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.delim skips first column (why?)
Hi, I have uploaded a copy of the file here: - http://pastebin.com/fd0edfab the file has also been passed throught the unix command tool unexpand, but it doesn't solve the problem. using head=TRUE instead of head=T has also the same effect. the output of print(names) is: print(names(ngly), quote=TRUE) [1] snp gene [3] chromosomedistance_from_gene_center [5] position ame [7] csasiaeasia [9] eur mena [11] oce ssafr [13] X X.1 [15] X.2 Thank you to all the people who answered me to my mail address, but I couldn't solve the problem yet. On Tue, Jul 14, 2009 at 12:36 AM, jim holtman jholt...@gmail.com wrote: Can you send your file as an attachment since it is impossible to see where the separator characters are. On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco Dall'Oliodalloli...@gmail.com wrote: Hi people, I have a text file like this one posted: snp_id genechromosome distance_from_gene_center positionpop1pop2pop3pop4pop5pop6pop7 rs2129081 RAPT2 3 -129993 upstream 0.439009 1.169210NA 0.2330200.093042NA -0.902596 rs1202698 RAPT2 3 -128695 upstream NA 1.815000NA 0.3990791.8142701.382950 NA rs1163207 RAPT2 3 -128224 upstream NA NA NA NA NA NA NA rs1834127 RAPT2 3 -128106 upstream NA NA NA NA NA NA 2.180670 rs2114211 RAPT2 3 -126738 upstream -0.468279 -1.447620 NA 0.010616-0.414581 NA 0.550447 rs2113151 RAPT2 3 -124620 upstream -0.897660 -1.971020 NA -0.920327 -0.764658 NA 0.337127 rs2524130 RAPT2 3 -123029 upstream -0.109795 -0.004646 -0.412059 1.1167400.667567 -0.924529 0.962841 rs1381318 RAPT2 3 -12818 upstream -0.911662 -1.791580 NA -0.945716 -1.239640 NA 0.004876 rs2113319 RAPT2 3 -122028 upstream -0.911662 -1.738610 NA -0.945716 -1.240950 NA -0.005318 When I use read.delim (or any read function) on it, R skips the first column, and I don' understand why. For example: $: R data = read.delim('snp_file.txt', head=T, sep='\t') Now, I would expect data$snp_id to contain snp ids, and data$gene to contain gene names; but it is not like this: data$snp_id [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 Levels: RAPT2 data$gene [1] 3 3 3 3 3 3 3 3 3 summary(data) snp_id gene chromosome distance_from_gene_center RAPT2:9 Min. :3 Min. :-129993 upstream:9 1st Qu.:3 1st Qu.:-128224 Median :3 Median :-126738 Mean :3 Mean :-113806 3rd Qu.:3 3rd Qu.:-123029 Max. :3 Max. : -12818 data$pop7 [1] NA NA NA NA NA NA NA NA NA Notice that it did use snp_id as the header for the first column, but it skips completely al the data from that column, and all the fields are shifted, so the last column is filled with NA values. What I am doing wrong? Can it be a problem of my data files? I have tried to modify them a bit (add new columns, etc..) but it didn't work. I am running R from an Ubuntu system: sessionInfo() R version 2.9.1 (2009-06-26) i486-pc-linux-gnu locale: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide
Re: [R] read.delim skips first column (why?)
Hi str(read.table(test.txt, header=T)) 'data.frame': 9 obs. of 12 variables: $ snp : Factor w/ 9 levels rs1113188,rs1113397,..: 9 5 7 8 3 4 6 1 2 $ gene : Factor w/ 1 level TRP2: 1 1 1 1 1 1 1 1 1 $ chromosome : int 3 3 3 3 3 3 3 3 3 It can be sometimes tricky to upload files to R. I would recommend if read.delim fils try read.table which has less assumptions and try to set parameters (heade, sep, dec) to get your file right Regards Petr r-help-boun...@r-project.org napsal dne 14.07.2009 11:11:10: Hi, I have uploaded a copy of the file here: - http://pastebin.com/fd0edfab the file has also been passed throught the unix command tool unexpand, but it doesn't solve the problem. using head=TRUE instead of head=T has also the same effect. the output of print(names) is: print(names(ngly), quote=TRUE) [1] snp gene [3] chromosomedistance_from_gene_center [5] position ame [7] csasiaeasia [9] eur mena [11] oce ssafr [13] X X.1 [15] X.2 Thank you to all the people who answered me to my mail address, but I couldn't solve the problem yet. On Tue, Jul 14, 2009 at 12:36 AM, jim holtman jholt...@gmail.com wrote: Can you send your file as an attachment since it is impossible to see where the separator characters are. On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco Dall'Oliodalloli...@gmail.com wrote: Hi people, I have a text file like this one posted: snp_id genechromosome distance_from_gene_center positionpop1pop2pop3pop4pop5pop6pop7 rs2129081 RAPT2 3 -129993 upstream 0.439009 1.169210NA 0.2330200.093042NA -0.902596 rs1202698 RAPT2 3 -128695 upstream NA 1.815000NA 0.3990791.8142701.382950 NA rs1163207 RAPT2 3 -128224 upstream NA NA NA NA NA NA NA rs1834127 RAPT2 3 -128106 upstream NA NA NA NA NA NA 2.180670 rs2114211 RAPT2 3 -126738 upstream -0.468279 -1.447620 NA 0.010616-0.414581 NA 0.550447 rs2113151 RAPT2 3 -124620 upstream -0.897660 -1.971020 NA -0.920327 -0.764658 NA 0.337127 rs2524130 RAPT2 3 -123029 upstream -0.109795 -0.004646 -0.412059 1.1167400.667567 -0.924529 0.962841 rs1381318 RAPT2 3 -12818 upstream -0.911662 -1.791580 NA -0.945716 -1.239640 NA 0.004876 rs2113319 RAPT2 3 -122028 upstream -0.911662 -1.738610 NA -0.945716 -1.240950 NA -0.005318 When I use read.delim (or any read function) on it, R skips the first column, and I don' understand why. For example: $: R data = read.delim('snp_file.txt', head=T, sep='\t') Now, I would expect data$snp_id to contain snp ids, and data$gene to contain gene names; but it is not like this: data$snp_id [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 Levels: RAPT2 data$gene [1] 3 3 3 3 3 3 3 3 3 summary(data) snp_id gene chromosome distance_from_gene_center RAPT2:9 Min. :3 Min. :-129993 upstream:9 1st Qu.:3 1st Qu.:-128224 Median :3 Median :-126738 Mean :3 Mean :-113806 3rd Qu.:3 3rd Qu.:-123029 Max. :3 Max. : -12818 data$pop7 [1] NA NA NA NA NA NA NA NA NA Notice that it did use snp_id as the header for the first column, but it skips completely al the data from that column, and all the fields are shifted, so the last column is filled with NA values. What I am doing wrong? Can it be a problem of my data files? I have tried to modify them a bit (add new columns, etc..) but it didn't work. I am running R from an Ubuntu system: sessionInfo() R version 2.9.1 (2009-06-26) i486-pc-linux-gnu locale: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list
Re: [R] read.delim skips first column (why?)
Try count.fields(myfile.txt, sep = \t) read.delim uses sep = \t but there are trailing tabs on some lines. The first line, i.e. with the headers, has three trailing tabs so it thinks that there are 15 columns rather than 12. The 5th line of the file (4th line of data) has 4 trailing tabs so it thinks that there are up to 16 fields in each data line. Since it now believes that there are 16 fields of data and 15 fields of headers it assumes the extra field, i.e. the first one, is the row names. On Tue, Jul 14, 2009 at 5:11 AM, Giovanni Marco Dall'Oliodalloli...@gmail.com wrote: Hi, I have uploaded a copy of the file here: - http://pastebin.com/fd0edfab the file has also been passed throught the unix command tool unexpand, but it doesn't solve the problem. using head=TRUE instead of head=T has also the same effect. the output of print(names) is: print(names(ngly), quote=TRUE) [1] snp gene [3] chromosome distance_from_gene_center [5] position ame [7] csasia easia [9] eur mena [11] oce ssafr [13] X X.1 [15] X.2 Thank you to all the people who answered me to my mail address, but I couldn't solve the problem yet. On Tue, Jul 14, 2009 at 12:36 AM, jim holtman jholt...@gmail.com wrote: Can you send your file as an attachment since it is impossible to see where the separator characters are. On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco Dall'Oliodalloli...@gmail.com wrote: Hi people, I have a text file like this one posted: snp_id gene chromosome distance_from_gene_center position pop1 pop2 pop3 pop4 pop5 pop6 pop7 rs2129081 RAPT2 3 -129993 upstream 0.439009 1.169210 NA 0.233020 0.093042 NA -0.902596 rs1202698 RAPT2 3 -128695 upstream NA 1.815000 NA 0.399079 1.814270 1.382950 NA rs1163207 RAPT2 3 -128224 upstream NA NA NA NA NA NA NA rs1834127 RAPT2 3 -128106 upstream NA NA NA NA NA NA 2.180670 rs2114211 RAPT2 3 -126738 upstream -0.468279 -1.447620 NA 0.010616 -0.414581 NA 0.550447 rs2113151 RAPT2 3 -124620 upstream -0.897660 -1.971020 NA -0.920327 -0.764658 NA 0.337127 rs2524130 RAPT2 3 -123029 upstream -0.109795 -0.004646 -0.412059 1.116740 0.667567 -0.924529 0.962841 rs1381318 RAPT2 3 -12818 upstream -0.911662 -1.791580 NA -0.945716 -1.239640 NA 0.004876 rs2113319 RAPT2 3 -122028 upstream -0.911662 -1.738610 NA -0.945716 -1.240950 NA -0.005318 When I use read.delim (or any read function) on it, R skips the first column, and I don' understand why. For example: $: R data = read.delim('snp_file.txt', head=T, sep='\t') Now, I would expect data$snp_id to contain snp ids, and data$gene to contain gene names; but it is not like this: data$snp_id [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 Levels: RAPT2 data$gene [1] 3 3 3 3 3 3 3 3 3 summary(data) snp_id gene chromosome distance_from_gene_center RAPT2:9 Min. :3 Min. :-129993 upstream:9 1st Qu.:3 1st Qu.:-128224 Median :3 Median :-126738 Mean :3 Mean :-113806 3rd Qu.:3 3rd Qu.:-123029 Max. :3 Max. : -12818 data$pop7 [1] NA NA NA NA NA NA NA NA NA Notice that it did use snp_id as the header for the first column, but it skips completely al the data from that column, and all the fields are shifted, so the last column is filled with NA values. What I am doing wrong? Can it be a problem of my data files? I have tried to modify them a bit (add new columns, etc..) but it didn't work. I am running R from an Ubuntu system: sessionInfo() R version 2.9.1 (2009-06-26) i486-pc-linux-gnu locale: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
[R] read.delim skips first column (why?)
Hi people, I have a text file like this one posted: snp_id genechromosome distance_from_gene_center positionpop1pop2pop3pop4pop5pop6pop7 rs2129081 RAPT2 3 -129993 upstream 0.439009 1.169210NA 0.2330200.093042NA -0.902596 rs1202698 RAPT2 3 -128695 upstream NA 1.815000NA 0.3990791.8142701.382950 NA rs1163207 RAPT2 3 -128224 upstream NA NA NA NA NA NA NA rs1834127 RAPT2 3 -128106 upstream NA NA NA NA NA NA 2.180670 rs2114211 RAPT2 3 -126738 upstream -0.468279 -1.447620 NA 0.010616-0.414581 NA 0.550447 rs2113151 RAPT2 3 -124620 upstream -0.897660 -1.971020 NA -0.920327 -0.764658 NA 0.337127 rs2524130 RAPT2 3 -123029 upstream -0.109795 -0.004646 -0.412059 1.1167400.667567 -0.924529 0.962841 rs1381318 RAPT2 3 -12818 upstream -0.911662 -1.791580 NA -0.945716 -1.239640 NA 0.004876 rs2113319 RAPT2 3 -122028 upstream -0.911662 -1.738610 NA -0.945716 -1.240950 NA -0.005318 When I use read.delim (or any read function) on it, R skips the first column, and I don' understand why. For example: $: R data = read.delim('snp_file.txt', head=T, sep='\t') Now, I would expect data$snp_id to contain snp ids, and data$gene to contain gene names; but it is not like this: data$snp_id [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 Levels: RAPT2 data$gene [1] 3 3 3 3 3 3 3 3 3 summary(data) snp_id gene chromosome distance_from_gene_center RAPT2:9 Min. :3 Min. :-129993 upstream:9 1st Qu.:3 1st Qu.:-128224 Median :3 Median :-126738 Mean :3 Mean :-113806 3rd Qu.:3 3rd Qu.:-123029 Max. :3 Max. : -12818 data$pop7 [1] NA NA NA NA NA NA NA NA NA Notice that it did use snp_id as the header for the first column, but it skips completely al the data from that column, and all the fields are shifted, so the last column is filled with NA values. What I am doing wrong? Can it be a problem of my data files? I have tried to modify them a bit (add new columns, etc..) but it didn't work. I am running R from an Ubuntu system: sessionInfo() R version 2.9.1 (2009-06-26) i486-pc-linux-gnu locale: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.