Re: [R] Fwd: Re: transpose and split dataframe

2019-05-06 Thread Matthew
Thank you very much Jim and David for your scripts and accompanying 
explanations.

I was intrigued at the results that came from David's script.  As seen 
below where I have taken a small piece of his DataTable:

AT1G69490 AT1G29860    AT4G18170 *AT5G46350*
AT1G01560    0    0    0    1
*AT1G02920*    1    2    2    4
AT1G02930    1    2    2    4
AT1G05675    1    1    1    2

    There are numbers other than 1 or 0, which was not what I was 
expecting. The data I am working with come from downloading results of 
an analysis done at a particular web site. I looked at Jim's solution, 
and the equivalent of the above would be:

     AT1G69490 _AT1G29860_ _AT1G29860_ AT4G18170    AT4G18170 
*AT5G46350    AT5G46350 AT5G46350    AT5G46350    AT5G46350*
AT1G01560    NA    NA    NA    NA    NA    NA NA    NA    AT1G01560    NA
*AT1G02920*    AT1G02920    AT1G02920 AT1G02920    AT1G02920    
AT1G02920    AT1G02920    AT1G02920 AT1G02920    AT1G02920    NA
AT1G02930    AT1G02930    AT1G02930    AT1G02930 AT1G02930    
AT1G02930    AT1G02930    AT1G02930    AT1G02930 AT1G02930    NA
AT1G05675    AT1G05675    AT1G05675    NA AT1G05675    NA    
AT1G05675    AT1G05675    NA    NA    NA

   The above is the format that I was desiring, but I was not expecting 
that a single ATG number would be the name of multiple columns. As shown 
above, _AT1G2960_ is the name of two columns and *AT5G46350* is the name 
of 5 columns (You may have to widen the e-mail across the screen to see 
it clearly). When a single ATG number, such as AT5G46350, names multiple 
columns, then the contents of each of those columns may or may not be 
the same. For example, going across a single row looking at *AT1G02920*, 
it occurs in the first column, hence the 1 in David's DataTable. It 
occurs in both AT1G29860 columns, hence the 2 in the DataTable. It again 
occurs in both AT4G18170 columns, so another 2 in the DataTable, and 
finally it occurs in only 4 of the 5 AT5G46350 columns, so the 4 in the 
DataTable.

     When the same ATG number names multiple columns it is because 
different methods were used to determine the content of each column. So, 
if an ATG number such as AT1G05675 occurs in all columns with the same 
name, I then know that it was by multiple methods that this has been 
shown, and if it only occurs in some of the columns, I know that all 
methods did not associate it with the column name ATG.  David's result 
complements Jim's, and both end up being very helpful to me.

   Thanks again to both of you for your time and help.

Matthew


On 5/2/2019 8:40 PM, Jim Lemon wrote:
>  External Email - Use Caution
>
> Hi again,
> Just noticed that the NA fill in the original solution is unnecessary, thus:
>
> # split the second column at the commas
> hitsplit<-strsplit(mmdf$hits,",")
> # get all the sorted hits
> allhits<-sort(unique(unlist(hitsplit)))
> tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
> # change the names of the list
> names(tmmdf)<-mmdf$Regulator
> for(column in 1:length(hitsplit)) {
>   hitmatches<-match(hitsplit[[column]],allhits)
>   hitmatches<-hitmatches[!is.na(hitmatches)]
>   tmmdf[hitmatches,column]<-allhits[hitmatches]
> }
>
> Jim
>
> On Fri, May 3, 2019 at 10:32 AM Jim Lemon  wrote:
>> Hi Matthew,
>> I'm not sure whether you want something like your initial request or
>> David's solution. The result of this can be transformed into the
>> latter:
>>
>> mmdf<-read.table(text="Regulator hits
>> AT1G69490 
>> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
>> AT1G29860 
>> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
>> AT1G2986 
>> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
>> header=TRUE,stringsAsFactors=

Re: [R] Fwd: Re: transpose and split dataframe

2019-05-02 Thread Jim Lemon
Hi again,
Just noticed that the NA fill in the original solution is unnecessary, thus:

# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# get all the sorted hits
allhits<-sort(unique(unlist(hitsplit)))
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
# change the names of the list
names(tmmdf)<-mmdf$Regulator
for(column in 1:length(hitsplit)) {
 hitmatches<-match(hitsplit[[column]],allhits)
 hitmatches<-hitmatches[!is.na(hitmatches)]
 tmmdf[hitmatches,column]<-allhits[hitmatches]
}

Jim

On Fri, May 3, 2019 at 10:32 AM Jim Lemon  wrote:
>
> Hi Matthew,
> I'm not sure whether you want something like your initial request or
> David's solution. The result of this can be transformed into the
> latter:
>
> mmdf<-read.table(text="Regulator hits
> AT1G69490 
> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> AT1G29860 
> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
> AT1G2986 
> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
> header=TRUE,stringsAsFactors=FALSE)
> # split the second column at the commas
> hitsplit<-strsplit(mmdf$hits,",")
> # define a function that will fill with NAs
> NAfill<-function(x,n) return(x[1:n])
> # get the maximum length of hits
> maxlen<-max(unlist(lapply(hitsplit,length)))
> # fill the list with NAs
> hitsplit<-lapply(hitsplit,NAfill,maxlen)
> # get all the sorted hits
> allhits<-sort(unique(unlist(hitsplit)))
> tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
> # change the names of the list
> names(tmmdf)<-mmdf$Regulator
> # replace all NA values in tmmdf where they appear in hitsplit
> for(column in 1:length(hitsplit)) {
>  hitmatches<-match(hitsplit[[column]],allhits)
>  hitmatches<-hitmatches[!is.na(hitmatches)]
>  tmmdf[hitmatches,column]<-allhits[hitmatches]
> }
>
> Jim
>
> On Fri, May 3, 2019 at 12:43 AM David L Carlson  wrote:
> >
> > We still have only the toy version of your data from your first email. The 
> > second email used dput() as I suggested, but you truncated the results so 
> > it is useless for testing purposes.
> >
> > Use the following code after creating DataList (up to mx <- ... ) in my 
> > earlier answer:
> >
> > n <- sapply(DataList, length)
> > hits <- unname(unlist(DataList))
> > Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
> > DataTable <- table(hits, Regulator)
> >
> > #Regulator
> > # hitsAT1G69490 AT2G55980
> > #  AT1G05675 1 0
> > #  AT1G26380 1 0
> > #  AT2G85403 0 1
> > #  AT4G31950 1 0
> > #  AT4G89223 0 1
> > #  AT5G24110 1 0
> >
> > Now the Regulators and the hits will be listed in alphabetical order. The 
> > table has 0's for Regulators that do not have a particular hit. If you want 
> > NAs:
> >
> > DataTable[DataTable==0] <- NA
> > print(DataTable, na.print="NA")
> > #Regulator
> > # hitsAT1G69490 AT2G55980
> > #   AT1G05675 1NA
> > #   AT1G26380 1NA
> > #   AT2G85403NA 1
> > #   AT4G31950 1NA
> > #   AT4G89223NA 1
> > #   AT5G24110 1NA
> >
> > If you need a data frame instead of a table:
> >
> > as.data.frame.matrix(DataTable)
> >
> > 
> > David L Carlson
> > Department of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> > -Original Message-
> > From: R-help  On Behalf Of Matthew
> > Sent: Tuesday, April 30, 2019 4:31 PM
> > To: r-help@r-project.org
> > Subject: [R] Fwd: Re: transpose and split dataframe
> >
> > Thanks for your reply. I was trying to simplify it a l

Re: [R] Fwd: Re: transpose and split dataframe

2019-05-02 Thread Jim Lemon
Hi Matthew,
I'm not sure whether you want something like your initial request or
David's solution. The result of this can be transformed into the
latter:

mmdf<-read.table(text="Regulator hits
AT1G69490 
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
AT1G29860 
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
AT1G2986 
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
header=TRUE,stringsAsFactors=FALSE)
# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# define a function that will fill with NAs
NAfill<-function(x,n) return(x[1:n])
# get the maximum length of hits
maxlen<-max(unlist(lapply(hitsplit,length)))
# fill the list with NAs
hitsplit<-lapply(hitsplit,NAfill,maxlen)
# get all the sorted hits
allhits<-sort(unique(unlist(hitsplit)))
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
# change the names of the list
names(tmmdf)<-mmdf$Regulator
# replace all NA values in tmmdf where they appear in hitsplit
for(column in 1:length(hitsplit)) {
 hitmatches<-match(hitsplit[[column]],allhits)
 hitmatches<-hitmatches[!is.na(hitmatches)]
 tmmdf[hitmatches,column]<-allhits[hitmatches]
}

Jim

On Fri, May 3, 2019 at 12:43 AM David L Carlson  wrote:
>
> We still have only the toy version of your data from your first email. The 
> second email used dput() as I suggested, but you truncated the results so it 
> is useless for testing purposes.
>
> Use the following code after creating DataList (up to mx <- ... ) in my 
> earlier answer:
>
> n <- sapply(DataList, length)
> hits <- unname(unlist(DataList))
> Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
> DataTable <- table(hits, Regulator)
>
> #Regulator
> # hitsAT1G69490 AT2G55980
> #  AT1G05675 1 0
> #  AT1G26380 1 0
> #  AT2G85403 0 1
> #  AT4G31950 1 0
> #  AT4G89223 0 1
> #  AT5G24110 1 0
>
> Now the Regulators and the hits will be listed in alphabetical order. The 
> table has 0's for Regulators that do not have a particular hit. If you want 
> NAs:
>
> DataTable[DataTable==0] <- NA
> print(DataTable, na.print="NA")
> #Regulator
> # hitsAT1G69490 AT2G55980
> #   AT1G05675 1NA
> #   AT1G26380 1NA
> #   AT2G85403NA 1
> #   AT4G31950 1NA
> #   AT4G89223NA 1
> #   AT5G24110 1NA
>
> If you need a data frame instead of a table:
>
> as.data.frame.matrix(DataTable)
>
> 
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
>
> -Original Message-
> From: R-help  On Behalf Of Matthew
> Sent: Tuesday, April 30, 2019 4:31 PM
> To: r-help@r-project.org
> Subject: [R] Fwd: Re: transpose and split dataframe
>
> Thanks for your reply. I was trying to simplify it a little, but must
> have got it wrong. Here is the real dataframe, TF2list:
>
>   str(TF2list)
> 'data.frame':152 obs. of  2 variables:
>   $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
> 54 82 82 82 82 82 ...
>   $ hits : Factor w/ 97 levels
> "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
>
> And the first few lines resulting from dput(head(TF2list)):
>
> dput(head(TF2list))
> structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
> "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
> "AT1G62990", "AT1G64000", "AT1G665

Re: [R] Fwd: Re: transpose and split dataframe

2019-05-02 Thread David L Carlson
We still have only the toy version of your data from your first email. The 
second email used dput() as I suggested, but you truncated the results so it is 
useless for testing purposes.

Use the following code after creating DataList (up to mx <- ... ) in my earlier 
answer:

n <- sapply(DataList, length)
hits <- unname(unlist(DataList))
Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
DataTable <- table(hits, Regulator)

#Regulator
# hitsAT1G69490 AT2G55980
#  AT1G05675 1 0
#  AT1G26380 1 0
#  AT2G85403 0 1
#  AT4G31950 1 0
#  AT4G89223 0 1
#  AT5G24110 1 0

Now the Regulators and the hits will be listed in alphabetical order. The table 
has 0's for Regulators that do not have a particular hit. If you want NAs:

DataTable[DataTable==0] <- NA
print(DataTable, na.print="NA")
#Regulator
# hitsAT1G69490 AT2G55980
#   AT1G05675 1NA
#   AT1G26380 1NA
#   AT2G85403NA 1
#   AT4G31950 1NA
#   AT4G89223NA 1
#   AT5G24110 1NA

If you need a data frame instead of a table:

as.data.frame.matrix(DataTable)


David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352

-Original Message-
From: R-help  On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 4:31 PM
To: r-help@r-project.org
Subject: [R] Fwd: Re: transpose and split dataframe

Thanks for your reply. I was trying to simplify it a little, but must 
have got it wrong. Here is the real dataframe, TF2list:

  str(TF2list)
'data.frame':    152 obs. of  2 variables:
  $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 
54 82 82 82 82 82 ...
  $ hits : Factor w/ 97 levels 
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
 
__truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...

    And the first few lines resulting from dput(head(TF2list)):

dput(head(TF2list))
structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
"AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
"AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
"AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...

This is another way of looking at the first 4 entries (Regulator is 
tab-separated from hits):

Regulator
   hits
1
AT1G69490
  
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
2
AT1G29860
  
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135

3
AT1G2986
  
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830

    So, the goal would be to

first: Transpose the existing dataframe so that the factor Regulator 
becomes a column name (column 1 name = AT1G69490, column2 name 
AT1G29860, etc.) and the hits associated with each Regulator become 
rows. Hits is a comma separated 'list' ( I do not not know if 
technically it is an R list.), so it would have to be comma 
'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950, 
col 1 row 2 - AT5G24410, etc); like this :

AT1G69490
AT4G31950
AT5G24110
AT1G05675
AT5G64905

... I did not include all the rows)

I think it would be best to actually make the first entry a separate 
dataframe ( 1 column with name = AT1G69490 and number of rows depending 
on the number of hits), then make the second column (column name = 
AT1G29860, and number of rows depending on the number of hits) into a 
new dataframe and do a full join of of the two dataframes; continue by 
making the thi

Re: [R] Fwd: Re: transpose and split dataframe

2019-05-01 Thread Matthew
Thank you very much, David and Jim for your work and solutions.

I have been working through both of them to better learn R. They both 
proceed through a similar logic except David's starts with a character 
matrix and Jim's with a dataframe, and both end with equivalent 
dataframes (  identical(tmmdf, TF2list2)) returns TRUE  ). They have 
both been very helpful. However, there is one attribute of my intended 
final dataframe that is missing.

Looking at part of the final dataframe:

  head(tmmdf)
   AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350
1 *AT4G31950* *AT4G31950*   AT5G64905 *AT4G31950* AT5G64905 *AT4G31950*
2 AT5G24110 AT5G24110   AT1G21120 AT5G24110   AT1G14540 AT5G24110
3 AT1G26380 AT1G05675   AT1G07160 AT1G05675   AT1G21120 AT1G05675

Row 1 has *AT4G31950* in columns 1,2,4 and 6, but AT4G31950 in columns 3 
and 5. What I was aiming at would be that each row would have a unique 
entry so that AT4G31950 is row 1 columns 1,2,4 and 6, and NA is row 1 
columns 3 and 5. AT4G31950 is row 2 columns 3 and 5 and NA is row 2 
columns 1,2,4 and 6. So, it would look like this:

  head(intended_df)
   AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350
1 AT4G31950 AT4G31950 NA                AT4G31950       NA         
AT4G31950

2      NA                NA           AT4G31950       NA            
AT4G31950      NA

I have been trying to adjust the code to get my intended result 
basically by trying to build a dataframe one column at a time from each 
entry in the character matrix, but have not got anything near working yet.

Matthew

On 4/30/2019 6:29 PM, David L Carlson wrote
> If you read the data frame with read.csv() or one of the other read() 
> functions, use the asis=TRUE argument to prevent conversion to factors. If 
> not do the conversion first:
>
> # Convert factors to characters
> DataMatrix <- sapply(TF2list, as.character)
> # Split the vector of hits
> DataList <- sapply(DataMatrix[, 2], strsplit, split=",")
> # Use the values in Regulator to name the parts of the list
> names(DataList) <- DataMatrix[,"Regulator"]
>
> # Now create a data frame
> # How long is the longest list of hits?
> mx <- max(sapply(DataList, length))
> # Now add NAs to vectors shorter than mx
> DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x
> # Finally convert back to a data frame
> TF2list2 <- do.call(data.frame, DataList2)
>
> Try this on a portion of the list, say 25 lines and print each object to see 
> what is happening.
>
> 
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
>
>
>
>
>
> -Original Message-
> From: R-help  On Behalf Of Matthew
> Sent: Tuesday, April 30, 2019 4:31 PM
> To: r-help@r-project.org
> Subject: [R] Fwd: Re: transpose and split dataframe
>
> Thanks for your reply. I was trying to simplify it a little, but must
> have got it wrong. Here is the real dataframe, TF2list:
>
>    str(TF2list)
> 'data.frame':    152 obs. of  2 variables:
>    $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
> 54 82 82 82 82 82 ...
>    $ hits : Factor w/ 97 levels
> "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
>
>      And the first few lines resulting from dput(head(TF2list)):
>
> dput(head(TF2list))
> structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
> "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
> "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
> "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...
>
> This is another way of looking at the first 4 entries (Regulator is
> tab-separated from hits):
>
> Regulator
>     hits
> 1
> AT1G69490
>    
> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> 2
> AT1G29860
>    
> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820

Re: [R] Fwd: Re: transpose and split dataframe

2019-04-30 Thread David L Carlson
If you read the data frame with read.csv() or one of the other read() 
functions, use the asis=TRUE argument to prevent conversion to factors. If not 
do the conversion first:

# Convert factors to characters
DataMatrix <- sapply(TF2list, as.character)
# Split the vector of hits
DataList <- sapply(DataMatrix[, 2], strsplit, split=",")
# Use the values in Regulator to name the parts of the list
names(DataList) <- DataMatrix[,"Regulator"]

# Now create a data frame
# How long is the longest list of hits?
mx <- max(sapply(DataList, length))
# Now add NAs to vectors shorter than mx
DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x
# Finally convert back to a data frame
TF2list2 <- do.call(data.frame, DataList2)

Try this on a portion of the list, say 25 lines and print each object to see 
what is happening. 


David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352





-Original Message-
From: R-help  On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 4:31 PM
To: r-help@r-project.org
Subject: [R] Fwd: Re: transpose and split dataframe

Thanks for your reply. I was trying to simplify it a little, but must 
have got it wrong. Here is the real dataframe, TF2list:

  str(TF2list)
'data.frame':    152 obs. of  2 variables:
  $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 
54 82 82 82 82 82 ...
  $ hits : Factor w/ 97 levels 
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
 
__truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...

    And the first few lines resulting from dput(head(TF2list)):

dput(head(TF2list))
structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
"AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
"AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
"AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...

This is another way of looking at the first 4 entries (Regulator is 
tab-separated from hits):

Regulator
   hits
1
AT1G69490
  
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
2
AT1G29860
  
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135

3
AT1G2986
  
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830

    So, the goal would be to

first: Transpose the existing dataframe so that the factor Regulator 
becomes a column name (column 1 name = AT1G69490, column2 name 
AT1G29860, etc.) and the hits associated with each Regulator become 
rows. Hits is a comma separated 'list' ( I do not not know if 
technically it is an R list.), so it would have to be comma 
'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950, 
col 1 row 2 - AT5G24410, etc); like this :

AT1G69490
AT4G31950
AT5G24110
AT1G05675
AT5G64905

... I did not include all the rows)

I think it would be best to actually make the first entry a separate 
dataframe ( 1 column with name = AT1G69490 and number of rows depending 
on the number of hits), then make the second column (column name = 
AT1G29860, and number of rows depending on the number of hits) into a 
new dataframe and do a full join of of the two dataframes; continue by 
making the third column (column name = AT1G2986) into a dataframe and 
full join it with the previous; continue for the 152 observations so 
that then end result is a dataframe with 152 columns and number of rows 
depending on the entry with the greatest number of hits. The full joins 
I can do with dplyr, but getting up to that point seems rather difficult.

This would get me what my ultimate goal would be; each Regulator is a