Re: [R] Fwd: Re: transpose and split dataframe
Thank you very much Jim and David for your scripts and accompanying explanations. I was intrigued at the results that came from David's script. As seen below where I have taken a small piece of his DataTable: AT1G69490 AT1G29860 AT4G18170 *AT5G46350* AT1G01560 0 0 0 1 *AT1G02920* 1 2 2 4 AT1G02930 1 2 2 4 AT1G05675 1 1 1 2 There are numbers other than 1 or 0, which was not what I was expecting. The data I am working with come from downloading results of an analysis done at a particular web site. I looked at Jim's solution, and the equivalent of the above would be: AT1G69490 _AT1G29860_ _AT1G29860_ AT4G18170 AT4G18170 *AT5G46350 AT5G46350 AT5G46350 AT5G46350 AT5G46350* AT1G01560 NA NA NA NA NA NA NA NA AT1G01560 NA *AT1G02920* AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 NA AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 NA AT1G05675 AT1G05675 AT1G05675 NA AT1G05675 NA AT1G05675 AT1G05675 NA NA NA The above is the format that I was desiring, but I was not expecting that a single ATG number would be the name of multiple columns. As shown above, _AT1G2960_ is the name of two columns and *AT5G46350* is the name of 5 columns (You may have to widen the e-mail across the screen to see it clearly). When a single ATG number, such as AT5G46350, names multiple columns, then the contents of each of those columns may or may not be the same. For example, going across a single row looking at *AT1G02920*, it occurs in the first column, hence the 1 in David's DataTable. It occurs in both AT1G29860 columns, hence the 2 in the DataTable. It again occurs in both AT4G18170 columns, so another 2 in the DataTable, and finally it occurs in only 4 of the 5 AT5G46350 columns, so the 4 in the DataTable. When the same ATG number names multiple columns it is because different methods were used to determine the content of each column. So, if an ATG number such as AT1G05675 occurs in all columns with the same name, I then know that it was by multiple methods that this has been shown, and if it only occurs in some of the columns, I know that all methods did not associate it with the column name ATG. David's result complements Jim's, and both end up being very helpful to me. Thanks again to both of you for your time and help. Matthew On 5/2/2019 8:40 PM, Jim Lemon wrote: > External Email - Use Caution > > Hi again, > Just noticed that the NA fill in the original solution is unnecessary, thus: > > # split the second column at the commas > hitsplit<-strsplit(mmdf$hits,",") > # get all the sorted hits > allhits<-sort(unique(unlist(hitsplit))) > tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits))) > # change the names of the list > names(tmmdf)<-mmdf$Regulator > for(column in 1:length(hitsplit)) { > hitmatches<-match(hitsplit[[column]],allhits) > hitmatches<-hitmatches[!is.na(hitmatches)] > tmmdf[hitmatches,column]<-allhits[hitmatches] > } > > Jim > > On Fri, May 3, 2019 at 10:32 AM Jim Lemon wrote: >> Hi Matthew, >> I'm not sure whether you want something like your initial request or >> David's solution. The result of this can be transformed into the >> latter: >> >> mmdf<-read.table(text="Regulator hits >> AT1G69490 >> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 >> AT1G29860 >> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 >> AT1G2986 >> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830", >> header=TRUE,stringsAsFactors=
Re: [R] Fwd: Re: transpose and split dataframe
Hi again, Just noticed that the NA fill in the original solution is unnecessary, thus: # split the second column at the commas hitsplit<-strsplit(mmdf$hits,",") # get all the sorted hits allhits<-sort(unique(unlist(hitsplit))) tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits))) # change the names of the list names(tmmdf)<-mmdf$Regulator for(column in 1:length(hitsplit)) { hitmatches<-match(hitsplit[[column]],allhits) hitmatches<-hitmatches[!is.na(hitmatches)] tmmdf[hitmatches,column]<-allhits[hitmatches] } Jim On Fri, May 3, 2019 at 10:32 AM Jim Lemon wrote: > > Hi Matthew, > I'm not sure whether you want something like your initial request or > David's solution. The result of this can be transformed into the > latter: > > mmdf<-read.table(text="Regulator hits > AT1G69490 > AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 > AT1G29860 > AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 > AT1G2986 > AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830", > header=TRUE,stringsAsFactors=FALSE) > # split the second column at the commas > hitsplit<-strsplit(mmdf$hits,",") > # define a function that will fill with NAs > NAfill<-function(x,n) return(x[1:n]) > # get the maximum length of hits > maxlen<-max(unlist(lapply(hitsplit,length))) > # fill the list with NAs > hitsplit<-lapply(hitsplit,NAfill,maxlen) > # get all the sorted hits > allhits<-sort(unique(unlist(hitsplit))) > tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits))) > # change the names of the list > names(tmmdf)<-mmdf$Regulator > # replace all NA values in tmmdf where they appear in hitsplit > for(column in 1:length(hitsplit)) { > hitmatches<-match(hitsplit[[column]],allhits) > hitmatches<-hitmatches[!is.na(hitmatches)] > tmmdf[hitmatches,column]<-allhits[hitmatches] > } > > Jim > > On Fri, May 3, 2019 at 12:43 AM David L Carlson wrote: > > > > We still have only the toy version of your data from your first email. The > > second email used dput() as I suggested, but you truncated the results so > > it is useless for testing purposes. > > > > Use the following code after creating DataList (up to mx <- ... ) in my > > earlier answer: > > > > n <- sapply(DataList, length) > > hits <- unname(unlist(DataList)) > > Regulator <- unname(unlist(mapply(rep, names(DataList), times=n))) > > DataTable <- table(hits, Regulator) > > > > #Regulator > > # hitsAT1G69490 AT2G55980 > > # AT1G05675 1 0 > > # AT1G26380 1 0 > > # AT2G85403 0 1 > > # AT4G31950 1 0 > > # AT4G89223 0 1 > > # AT5G24110 1 0 > > > > Now the Regulators and the hits will be listed in alphabetical order. The > > table has 0's for Regulators that do not have a particular hit. If you want > > NAs: > > > > DataTable[DataTable==0] <- NA > > print(DataTable, na.print="NA") > > #Regulator > > # hitsAT1G69490 AT2G55980 > > # AT1G05675 1NA > > # AT1G26380 1NA > > # AT2G85403NA 1 > > # AT4G31950 1NA > > # AT4G89223NA 1 > > # AT5G24110 1NA > > > > If you need a data frame instead of a table: > > > > as.data.frame.matrix(DataTable) > > > > > > David L Carlson > > Department of Anthropology > > Texas A&M University > > College Station, TX 77843-4352 > > > > -Original Message- > > From: R-help On Behalf Of Matthew > > Sent: Tuesday, April 30, 2019 4:31 PM > > To: r-help@r-project.org > > Subject: [R] Fwd: Re: transpose and split dataframe > > > > Thanks for your reply. I was trying to simplify it a l
Re: [R] Fwd: Re: transpose and split dataframe
Hi Matthew, I'm not sure whether you want something like your initial request or David's solution. The result of this can be transformed into the latter: mmdf<-read.table(text="Regulator hits AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 AT1G29860 AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 AT1G2986 AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830", header=TRUE,stringsAsFactors=FALSE) # split the second column at the commas hitsplit<-strsplit(mmdf$hits,",") # define a function that will fill with NAs NAfill<-function(x,n) return(x[1:n]) # get the maximum length of hits maxlen<-max(unlist(lapply(hitsplit,length))) # fill the list with NAs hitsplit<-lapply(hitsplit,NAfill,maxlen) # get all the sorted hits allhits<-sort(unique(unlist(hitsplit))) tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits))) # change the names of the list names(tmmdf)<-mmdf$Regulator # replace all NA values in tmmdf where they appear in hitsplit for(column in 1:length(hitsplit)) { hitmatches<-match(hitsplit[[column]],allhits) hitmatches<-hitmatches[!is.na(hitmatches)] tmmdf[hitmatches,column]<-allhits[hitmatches] } Jim On Fri, May 3, 2019 at 12:43 AM David L Carlson wrote: > > We still have only the toy version of your data from your first email. The > second email used dput() as I suggested, but you truncated the results so it > is useless for testing purposes. > > Use the following code after creating DataList (up to mx <- ... ) in my > earlier answer: > > n <- sapply(DataList, length) > hits <- unname(unlist(DataList)) > Regulator <- unname(unlist(mapply(rep, names(DataList), times=n))) > DataTable <- table(hits, Regulator) > > #Regulator > # hitsAT1G69490 AT2G55980 > # AT1G05675 1 0 > # AT1G26380 1 0 > # AT2G85403 0 1 > # AT4G31950 1 0 > # AT4G89223 0 1 > # AT5G24110 1 0 > > Now the Regulators and the hits will be listed in alphabetical order. The > table has 0's for Regulators that do not have a particular hit. If you want > NAs: > > DataTable[DataTable==0] <- NA > print(DataTable, na.print="NA") > #Regulator > # hitsAT1G69490 AT2G55980 > # AT1G05675 1NA > # AT1G26380 1NA > # AT2G85403NA 1 > # AT4G31950 1NA > # AT4G89223NA 1 > # AT5G24110 1NA > > If you need a data frame instead of a table: > > as.data.frame.matrix(DataTable) > > > David L Carlson > Department of Anthropology > Texas A&M University > College Station, TX 77843-4352 > > -Original Message- > From: R-help On Behalf Of Matthew > Sent: Tuesday, April 30, 2019 4:31 PM > To: r-help@r-project.org > Subject: [R] Fwd: Re: transpose and split dataframe > > Thanks for your reply. I was trying to simplify it a little, but must > have got it wrong. Here is the real dataframe, TF2list: > > str(TF2list) > 'data.frame':152 obs. of 2 variables: > $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 > 54 82 82 82 82 82 ... > $ hits : Factor w/ 97 levels > "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| > __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... > > And the first few lines resulting from dput(head(TF2list)): > > dput(head(TF2list)) > structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, > 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", > "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", > "AT1G62990", "AT1G64000", "AT1G665
Re: [R] Fwd: Re: transpose and split dataframe
We still have only the toy version of your data from your first email. The second email used dput() as I suggested, but you truncated the results so it is useless for testing purposes. Use the following code after creating DataList (up to mx <- ... ) in my earlier answer: n <- sapply(DataList, length) hits <- unname(unlist(DataList)) Regulator <- unname(unlist(mapply(rep, names(DataList), times=n))) DataTable <- table(hits, Regulator) #Regulator # hitsAT1G69490 AT2G55980 # AT1G05675 1 0 # AT1G26380 1 0 # AT2G85403 0 1 # AT4G31950 1 0 # AT4G89223 0 1 # AT5G24110 1 0 Now the Regulators and the hits will be listed in alphabetical order. The table has 0's for Regulators that do not have a particular hit. If you want NAs: DataTable[DataTable==0] <- NA print(DataTable, na.print="NA") #Regulator # hitsAT1G69490 AT2G55980 # AT1G05675 1NA # AT1G26380 1NA # AT2G85403NA 1 # AT4G31950 1NA # AT4G89223NA 1 # AT5G24110 1NA If you need a data frame instead of a table: as.data.frame.matrix(DataTable) David L Carlson Department of Anthropology Texas A&M University College Station, TX 77843-4352 -Original Message- From: R-help On Behalf Of Matthew Sent: Tuesday, April 30, 2019 4:31 PM To: r-help@r-project.org Subject: [R] Fwd: Re: transpose and split dataframe Thanks for your reply. I was trying to simplify it a little, but must have got it wrong. Here is the real dataframe, TF2list: str(TF2list) 'data.frame': 152 obs. of 2 variables: $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 54 82 82 82 82 82 ... $ hits : Factor w/ 97 levels "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... And the first few lines resulting from dput(head(TF2list)): dput(head(TF2list)) structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600", "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ... This is another way of looking at the first 4 entries (Regulator is tab-separated from hits): Regulator hits 1 AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 2 AT1G29860 AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 3 AT1G2986 AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830 So, the goal would be to first: Transpose the existing dataframe so that the factor Regulator becomes a column name (column 1 name = AT1G69490, column2 name AT1G29860, etc.) and the hits associated with each Regulator become rows. Hits is a comma separated 'list' ( I do not not know if technically it is an R list.), so it would have to be comma 'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950, col 1 row 2 - AT5G24410, etc); like this : AT1G69490 AT4G31950 AT5G24110 AT1G05675 AT5G64905 ... I did not include all the rows) I think it would be best to actually make the first entry a separate dataframe ( 1 column with name = AT1G69490 and number of rows depending on the number of hits), then make the second column (column name = AT1G29860, and number of rows depending on the number of hits) into a new dataframe and do a full join of of the two dataframes; continue by making the thi
Re: [R] Fwd: Re: transpose and split dataframe
Thank you very much, David and Jim for your work and solutions. I have been working through both of them to better learn R. They both proceed through a similar logic except David's starts with a character matrix and Jim's with a dataframe, and both end with equivalent dataframes ( identical(tmmdf, TF2list2)) returns TRUE ). They have both been very helpful. However, there is one attribute of my intended final dataframe that is missing. Looking at part of the final dataframe: head(tmmdf) AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350 1 *AT4G31950* *AT4G31950* AT5G64905 *AT4G31950* AT5G64905 *AT4G31950* 2 AT5G24110 AT5G24110 AT1G21120 AT5G24110 AT1G14540 AT5G24110 3 AT1G26380 AT1G05675 AT1G07160 AT1G05675 AT1G21120 AT1G05675 Row 1 has *AT4G31950* in columns 1,2,4 and 6, but AT4G31950 in columns 3 and 5. What I was aiming at would be that each row would have a unique entry so that AT4G31950 is row 1 columns 1,2,4 and 6, and NA is row 1 columns 3 and 5. AT4G31950 is row 2 columns 3 and 5 and NA is row 2 columns 1,2,4 and 6. So, it would look like this: head(intended_df) AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350 1 AT4G31950 AT4G31950 NA AT4G31950 NA AT4G31950 2 NA NA AT4G31950 NA AT4G31950 NA I have been trying to adjust the code to get my intended result basically by trying to build a dataframe one column at a time from each entry in the character matrix, but have not got anything near working yet. Matthew On 4/30/2019 6:29 PM, David L Carlson wrote > If you read the data frame with read.csv() or one of the other read() > functions, use the asis=TRUE argument to prevent conversion to factors. If > not do the conversion first: > > # Convert factors to characters > DataMatrix <- sapply(TF2list, as.character) > # Split the vector of hits > DataList <- sapply(DataMatrix[, 2], strsplit, split=",") > # Use the values in Regulator to name the parts of the list > names(DataList) <- DataMatrix[,"Regulator"] > > # Now create a data frame > # How long is the longest list of hits? > mx <- max(sapply(DataList, length)) > # Now add NAs to vectors shorter than mx > DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x > # Finally convert back to a data frame > TF2list2 <- do.call(data.frame, DataList2) > > Try this on a portion of the list, say 25 lines and print each object to see > what is happening. > > > David L Carlson > Department of Anthropology > Texas A&M University > College Station, TX 77843-4352 > > > > > > -Original Message- > From: R-help On Behalf Of Matthew > Sent: Tuesday, April 30, 2019 4:31 PM > To: r-help@r-project.org > Subject: [R] Fwd: Re: transpose and split dataframe > > Thanks for your reply. I was trying to simplify it a little, but must > have got it wrong. Here is the real dataframe, TF2list: > > str(TF2list) > 'data.frame': 152 obs. of 2 variables: > $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 > 54 82 82 82 82 82 ... > $ hits : Factor w/ 97 levels > "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| > __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... > > And the first few lines resulting from dput(head(TF2list)): > > dput(head(TF2list)) > structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, > 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", > "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", > "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600", > "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ... > > This is another way of looking at the first 4 entries (Regulator is > tab-separated from hits): > > Regulator > hits > 1 > AT1G69490 > > AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 > 2 > AT1G29860 > > AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820
Re: [R] Fwd: Re: transpose and split dataframe
If you read the data frame with read.csv() or one of the other read() functions, use the asis=TRUE argument to prevent conversion to factors. If not do the conversion first: # Convert factors to characters DataMatrix <- sapply(TF2list, as.character) # Split the vector of hits DataList <- sapply(DataMatrix[, 2], strsplit, split=",") # Use the values in Regulator to name the parts of the list names(DataList) <- DataMatrix[,"Regulator"] # Now create a data frame # How long is the longest list of hits? mx <- max(sapply(DataList, length)) # Now add NAs to vectors shorter than mx DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x # Finally convert back to a data frame TF2list2 <- do.call(data.frame, DataList2) Try this on a portion of the list, say 25 lines and print each object to see what is happening. David L Carlson Department of Anthropology Texas A&M University College Station, TX 77843-4352 -Original Message- From: R-help On Behalf Of Matthew Sent: Tuesday, April 30, 2019 4:31 PM To: r-help@r-project.org Subject: [R] Fwd: Re: transpose and split dataframe Thanks for your reply. I was trying to simplify it a little, but must have got it wrong. Here is the real dataframe, TF2list: str(TF2list) 'data.frame': 152 obs. of 2 variables: $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 54 82 82 82 82 82 ... $ hits : Factor w/ 97 levels "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... And the first few lines resulting from dput(head(TF2list)): dput(head(TF2list)) structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600", "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ... This is another way of looking at the first 4 entries (Regulator is tab-separated from hits): Regulator hits 1 AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 2 AT1G29860 AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 3 AT1G2986 AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830 So, the goal would be to first: Transpose the existing dataframe so that the factor Regulator becomes a column name (column 1 name = AT1G69490, column2 name AT1G29860, etc.) and the hits associated with each Regulator become rows. Hits is a comma separated 'list' ( I do not not know if technically it is an R list.), so it would have to be comma 'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950, col 1 row 2 - AT5G24410, etc); like this : AT1G69490 AT4G31950 AT5G24110 AT1G05675 AT5G64905 ... I did not include all the rows) I think it would be best to actually make the first entry a separate dataframe ( 1 column with name = AT1G69490 and number of rows depending on the number of hits), then make the second column (column name = AT1G29860, and number of rows depending on the number of hits) into a new dataframe and do a full join of of the two dataframes; continue by making the third column (column name = AT1G2986) into a dataframe and full join it with the previous; continue for the 152 observations so that then end result is a dataframe with 152 columns and number of rows depending on the entry with the greatest number of hits. The full joins I can do with dplyr, but getting up to that point seems rather difficult. This would get me what my ultimate goal would be; each Regulator is a