Re: [R] Fwd: Re: transpose and split dataframe

Jim Lemon Thu, 02 May 2019 17:44:17 -0700

Hi again,
Just noticed that the NA fill in the original solution is unnecessary, thus:


# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# get all the sorted hits
allhits<-sort(unique(unlist(hitsplit)))
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
# change the names of the list
names(tmmdf)<-mmdf$Regulator
for(column in 1:length(hitsplit)) {
 hitmatches<-match(hitsplit[[column]],allhits)
 hitmatches<-hitmatches[!is.na(hitmatches)]
 tmmdf[hitmatches,column]<-allhits[hitmatches]
}

Jim

On Fri, May 3, 2019 at 10:32 AM Jim Lemon <drjimle...@gmail.com> wrote:
>
> Hi Matthew,
> I'm not sure whether you want something like your initial request or
> David's solution. The result of this can be transformed into the
> latter:
>
> mmdf<-read.table(text="Regulator hits
> AT1G69490 
> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> AT1G29860 
> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
> AT1G2986 
> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
> header=TRUE,stringsAsFactors=FALSE)
> # split the second column at the commas
> hitsplit<-strsplit(mmdf$hits,",")
> # define a function that will fill with NAs
> NAfill<-function(x,n) return(x[1:n])
> # get the maximum length of hits
> maxlen<-max(unlist(lapply(hitsplit,length)))
> # fill the list with NAs
> hitsplit<-lapply(hitsplit,NAfill,maxlen)
> # get all the sorted hits
> allhits<-sort(unique(unlist(hitsplit)))
> tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
> # change the names of the list
> names(tmmdf)<-mmdf$Regulator
> # replace all NA values in tmmdf where they appear in hitsplit
> for(column in 1:length(hitsplit)) {
>  hitmatches<-match(hitsplit[[column]],allhits)
>  hitmatches<-hitmatches[!is.na(hitmatches)]
>  tmmdf[hitmatches,column]<-allhits[hitmatches]
> }
>
> Jim
>
> On Fri, May 3, 2019 at 12:43 AM David L Carlson <dcarl...@tamu.edu> wrote:
> >
> > We still have only the toy version of your data from your first email. The 
> > second email used dput() as I suggested, but you truncated the results so 
> > it is useless for testing purposes.
> >
> > Use the following code after creating DataList (up to mx <- ... ) in my 
> > earlier answer:
> >
> > n <- sapply(DataList, length)
> > hits <- unname(unlist(DataList))
> > Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
> > DataTable <- table(hits, Regulator)
> >
> > #            Regulator
> > # hits        AT1G69490 AT2G55980
> > #  AT1G05675         1         0
> > #  AT1G26380         1         0
> > #  AT2G85403         0         1
> > #  AT4G31950         1         0
> > #  AT4G89223         0         1
> > #  AT5G24110         1         0
> >
> > Now the Regulators and the hits will be listed in alphabetical order. The 
> > table has 0's for Regulators that do not have a particular hit. If you want 
> > NAs:
> >
> > DataTable[DataTable==0] <- NA
> > print(DataTable, na.print="NA")
> > #            Regulator
> > # hits        AT1G69490 AT2G55980
> > #   AT1G05675         1        NA
> > #   AT1G26380         1        NA
> > #   AT2G85403        NA         1
> > #   AT4G31950         1        NA
> > #   AT4G89223        NA         1
> > #   AT5G24110         1        NA
> >
> > If you need a data frame instead of a table:
> >
> > as.data.frame.matrix(DataTable)
> >
> > ----------------------------------------
> > David L Carlson
> > Department of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> > -----Original Message-----
> > From: R-help <r-help-boun...@r-project.org> On Behalf Of Matthew
> > Sent: Tuesday, April 30, 2019 4:31 PM
> > To: r-help@r-project.org
> > Subject: [R] Fwd: Re: transpose and split dataframe
> >
> > Thanks for your reply. I was trying to simplify it a little, but must
> > have got it wrong. Here is the real dataframe, TF2list:
> >
> >   str(TF2list)
> > 'data.frame':    152 obs. of  2 variables:
> >   $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
> > 54 82 82 82 82 82 ...
> >   $ hits     : Factor w/ 97 levels
> > "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> > __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
> >
> >     And the first few lines resulting from dput(head(TF2list)):
> >
> > dput(head(TF2list))
> > structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> > 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
> > "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
> > "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
> > "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...
> >
> > This is another way of looking at the first 4 entries (Regulator is
> > tab-separated from hits):
> >
> > Regulator
> >    hits
> > 1
> > AT1G69490
> >   
> > AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> > 2
> > AT1G29860
> >   
> > AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
> >
> > 3
> > AT1G2986
> >   
> > AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
> >
> >     So, the goal would be to
> >
> > first: Transpose the existing dataframe so that the factor Regulator
> > becomes a column name (column 1 name = AT1G69490, column2 name
> > AT1G29860, etc.) and the hits associated with each Regulator become
> > rows. Hits is a comma separated 'list' ( I do not not know if
> > technically it is an R list.), so it would have to be comma
> > 'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950,
> > col 1 row 2 - AT5G24410, etc); like this :
> >
> > AT1G69490
> > AT4G31950
> > AT5G24110
> > AT1G05675
> > AT5G64905
> >
> > ... I did not include all the rows)
> >
> > I think it would be best to actually make the first entry a separate
> > dataframe ( 1 column with name = AT1G69490 and number of rows depending
> > on the number of hits), then make the second column (column name =
> > AT1G29860, and number of rows depending on the number of hits) into a
> > new dataframe and do a full join of of the two dataframes; continue by
> > making the third column (column name = AT1G2986) into a dataframe and
> > full join it with the previous; continue for the 152 observations so
> > that then end result is a dataframe with 152 columns and number of rows
> > depending on the entry with the greatest number of hits. The full joins
> > I can do with dplyr, but getting up to that point seems rather difficult.
> >
> > This would get me what my ultimate goal would be; each Regulator is a
> > column name (152 columns) and a given row has either NA or the same hit.
> >
> >     This seems very difficult to me, but I appreciate any attempt.
> >
> > Matthew
> >
> > On 4/30/2019 4:34 PM, David L Carlson wrote:
> > >          External Email - Use Caution
> > >
> > > I think we need more information. Can you give us the structure of the 
> > > data with str(YourDataFrame). Alternatively you could copy a small piece 
> > > into your email message by copying and pasting the results of the 
> > > following code:
> > >
> > > dput(head(YourDataFrame))
> > >
> > > The data frame you present could not be a data frame since you say "hits" 
> > > is a factor with a variable number of elements. If each value of "hits" 
> > > was a single character string, it would only have 2 factor levels not 6 
> > > and your efforts to parse the string would make more sense. Transposing 
> > > to a data frame would only be possible if each column was padded with NAs 
> > > to make them equal in length. Since your example tries use the name 
> > > TF2list, it is possible that you do not have a data frame but a list and 
> > > you have no factor levels, just character vectors.
> > >
> > > If you are not familiar with R, it may be helpful to tell us what your 
> > > overall goal is rather than an intermediate step. Very likely R can 
> > > easily handle what you want by doing things a different way.
> > >
> > > ----------------------------------------
> > > David L Carlson
> > > Department of Anthropology
> > > Texas A&M University
> > > College Station, TX 77843-4352
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: R-help<r-help-boun...@r-project.org>  On Behalf Of Matthew
> > > Sent: Tuesday, April 30, 2019 2:25 PM
> > > To: r-help (r-help@r-project.org)<r-help@r-project.org>
> > > Subject: [R] transpose and split dataframe
> > >
> > > I have a data frame that is a lot bigger but for simplicity sake we can
> > > say it looks like this:
> > >
> > > Regulator    hits
> > > AT1G69490    AT4G31950,AT5G24110,AT1G26380,AT1G05675
> > > AT2G55980    AT2G85403,AT4G89223
> > >
> > >      In other words:
> > >
> > > data.frame : 2 obs. of 2 variables
> > > $Regulator: Factor w/ 2 levels
> > > $hits         : Factor w/ 6 levels
> > >
> > >     I want to transpose it so that Regulator is now the column headings
> > > and each of the AGI numbers now separated by commas is a row. So,
> > > AT1G69490 is now the header of the first column and AT4G31950 is row 1
> > > of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
> > > column 2 and AT2G85403 is row 1 of column 2, etc.
> > >
> > >     I have tried playing around with strsplit(TF2list[2:2]) and
> > > strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
> > >
> > > Matthew
> > >
> > > ______________________________________________
> > > R-help@r-project.org  mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting 
> > > guidehttp://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: Re: transpose and split dataframe

Reply via email to