Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

2014-10-19 Thread Jeff Newmiller

On Sat, 18 Oct 2014, Anna Zakrisson Braeunlich wrote:


Thank you! That was a easy and fast solution!


If it was so easy, why couldn't you adapt Ista's solution? I suspect 
it is because you don't understand his suggestion.


May I post a follow-up question? (I am not sure if this would rather 
should be posted as a new question, but I post it here and then I can 
re-post it if this is the wrong place to ask this). I am ever so 
grateful for your help!


This probably should have been a new thread, but I will bite anyway.


/Anna


# FOLLOW-UP QUESTION 


df1 - data.frame(cbind(Identifier = c(M123.B23.VJHJ, M123.B24.VJHJ,
  M123.B23.VLKE, M123.B23.HKJH,
  M123.B24.LKJH),
  Sequence = c(ATATATATATA, ATATATATATA,
   ATATAGCATATA, 
ATATATAGGGTA,
   ATCGCGCGAATA)))


Just because R has a habit of making factors at the drop of a hat doesn't 
mean that if your data are still not ready to be treated as factors that 
you have to accept what it does. When you read the data in via read.csv 
you can use the colClasses argument or the stringsAsFactors argument to 
stop that. You can use stringsAsFactors when you create the data frame 
also, and you can always go back and turn any particular column into a 
factor after you are done manipulating characters.


Also, somewhere you picked up the bad habit of using cbind before you make 
your data frame... that is almost never a good idea, because in data 
frames each column can be have its own storage mode, while cbind creates a 
matrix where every element must have the same storage mode.


df1 - data.frame( Identifier = c( M123.B23.VJHJ, M123.B24.VJHJ
 , M123.B23.VLKE, M123.B23.HKJH
 , M123.B24.LKJH )
 , Sequence = c( ATATATATATA, ATATATATATA
   , ATATAGCATATA, ATATATAGGGTA
   , ATCGCGCGAATA )
 , stringsAsFactors=FALSE
 )




# as a follow-up question:
# How can I split the identifier in df1 above into several columns based on the
# separating dots? The real data includes thousands of rows.
# This is what I want it to look like in the end:

df1_solution - data.frame(cbind(Identifier1 = c(M123, M123,
  M123, M123,
  M123),
   Identifier2 = c(B23, B24, B23, B23, B24),
   Identifier3 = c(VJHJ, VJHJ, VLKE, HKJH, LKJH),
   Sequence = c(ATATATATATA, ATATATATATA,
ATATAGCATATA, ATATATAGGGTA,
ATCGCGCGAATA)))


df1_solution - data.frame( Identifier1 = c( M123, M123
   , M123, M123
   , M123 )
  , Identifier2 = c( B23, B24, B23
   , B23, B24 )
  , Identifier3 = c( VJHJ, VJHJ
   , VLKE, HKJH, LKJH)
  , Sequence = c( ATATATATATA, ATATATATATA
, ATATAGCATATA, ATATATAGGGTA
, ATCGCGCGAATA )
  , stringsAsFactors=FALSE
 )



# I am very grateful for your help! I am no whiz at R and everything I know
# is self-taught. Therefore, some basics can turn out to be quite some
# obsatcles for me.
# /Anna


Pretty much all of us are here to teach ourselves R, Anna. Keep reading 
other people's questions. Learn to try each fragment alone at the command 
line to figure out what is happening. Use the str() function frequently.


# the basic split
parts - strsplit( as.character( df1$Identifier ), ., fixed=TRUE )

# extension of Ista's approach to assembly
ans1 - data.frame( Identifier1 = rep( NA, nrow( ans1 ) )
  , Identifier2 = rep( NA, nrow( ans1 ) )
  , Identifier3 = rep( NA, nrow( ans1 ) )
  , stringsAsFactors = FALSE
)
# note all memory is pre-allocated above... avoid successively
# accumulating rows with rbind... that would be very slow
for ( rw in seq_along( df1$Identifier ) ) {
  v - parts[[ rw ]]
  ans1[ rw, Identifier1 ] - v[ 1 ]
  ans1[ rw, Identifier2 ] - v[ 2 ]
  ans1[ rw, Identifier3 ] - v[ 3 ]
}
ans1$Sequence - df1$Sequence

# alternative method of assembly
# uses list-to-data.frame apply from plyr package
library(plyr)
ans2 - ldply( parts
 , function( v ) { # called once for each item in parts list
  # all single-row data frames created in this
  # function are concatenated at once by ldply to
 

Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

2014-10-19 Thread Anna Zakrisson Braeunlich
Hi Jeff an many thank's for your time. 
I meant that it was easy as in not requiring som many steps...

I have managed to get it to run and it has solved my problem perfectly. Thank 
you for all your tips and instructions! Up until now, I have used excel for all 
problems similar to this one and only used R for the statistics part. I intend 
to make a permanent swith to R as I can see the benefits. I am just severely 
frustrated by my own inabilities. Therefore, once again many thank´s for your 
time! 

kind regards
Anna

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. .• 
`•. .º

Anna Zakrisson Braeunlich
PhD student

Department of Ecology, Environment and Plant Sciences
Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige

Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland

E-mail: anna.zakris...@su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. .• 
`•. .º


From: Jeff Newmiller [jdnew...@dcn.davis.ca.us]
Sent: 19 October 2014 08:25
To: Anna Zakrisson Braeunlich
Cc: Ista Zahn; r-help@r-project.org
Subject: Re: [R] seqinr ?: Splitting a factor name into several columns. 
Dealing with metabarcoding data.

On Sat, 18 Oct 2014, Anna Zakrisson Braeunlich wrote:

 Thank you! That was a easy and fast solution!

If it was so easy, why couldn't you adapt Ista's solution? I suspect
it is because you don't understand his suggestion.

 May I post a follow-up question? (I am not sure if this would rather
 should be posted as a new question, but I post it here and then I can
 re-post it if this is the wrong place to ask this). I am ever so
 grateful for your help!

This probably should have been a new thread, but I will bite anyway.

 /Anna


 # FOLLOW-UP QUESTION 
 

 df1 - data.frame(cbind(Identifier = c(M123.B23.VJHJ, M123.B24.VJHJ,
   M123.B23.VLKE, M123.B23.HKJH,
   M123.B24.LKJH),
   Sequence = c(ATATATATATA, 
 ATATATATATA,
ATATAGCATATA, 
 ATATATAGGGTA,
ATCGCGCGAATA)))

Just because R has a habit of making factors at the drop of a hat doesn't
mean that if your data are still not ready to be treated as factors that
you have to accept what it does. When you read the data in via read.csv
you can use the colClasses argument or the stringsAsFactors argument to
stop that. You can use stringsAsFactors when you create the data frame
also, and you can always go back and turn any particular column into a
factor after you are done manipulating characters.

Also, somewhere you picked up the bad habit of using cbind before you make
your data frame... that is almost never a good idea, because in data
frames each column can be have its own storage mode, while cbind creates a
matrix where every element must have the same storage mode.

df1 - data.frame( Identifier = c( M123.B23.VJHJ, M123.B24.VJHJ
  , M123.B23.VLKE, M123.B23.HKJH
  , M123.B24.LKJH )
  , Sequence = c( ATATATATATA, ATATATATATA
, ATATAGCATATA, ATATATAGGGTA
, ATCGCGCGAATA )
  , stringsAsFactors=FALSE
  )



 # as a follow-up question:
 # How can I split the identifier in df1 above into several columns based on 
 the
 # separating dots? The real data includes thousands of rows.
 # This is what I want it to look like in the end:

 df1_solution - data.frame(cbind(Identifier1 = c(M123, M123,
   M123, M123,
   M123),
Identifier2 = c(B23, B24, B23, B23, B24),
Identifier3 = c(VJHJ, VJHJ, VLKE, HKJH, 
 LKJH),
Sequence = c(ATATATATATA, ATATATATATA,
 ATATAGCATATA, ATATATAGGGTA,
 ATCGCGCGAATA)))

df1_solution - data.frame( Identifier1 = c( M123, M123
, M123, M123
, M123 )
   , Identifier2 = c( B23, B24, B23
, B23, B24 )
   , Identifier3 = c( VJHJ, VJHJ
, VLKE, HKJH, LKJH)
   , Sequence = c( ATATATATATA, ATATATATATA
 , ATATAGCATATA, ATATATAGGGTA
 , ATCGCGCGAATA )
   , stringsAsFactors=FALSE
 

Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

2014-10-18 Thread Anna Zakrisson Braeunlich
Thank you! That was a easy and fast solution!

May I post a follow-up question? (I am not sure if this would rather should be 
posted as a new question, but I post it here and then I can re-post it if this 
is the wrong place to ask this). I am ever so grateful for your help!
/Anna


# FOLLOW-UP QUESTION 


df1 - data.frame(cbind(Identifier = c(M123.B23.VJHJ, M123.B24.VJHJ,
   M123.B23.VLKE, M123.B23.HKJH,
   M123.B24.LKJH),
   Sequence = c(ATATATATATA, 
ATATATATATA,
ATATAGCATATA, 
ATATATAGGGTA,
ATCGCGCGAATA))) 



# as a follow-up question:
# How can I split the identifier in df1 above into several columns based on the 
# separating dots? The real data includes thousands of rows.
# This is what I want it to look like in the end:

df1_solution - data.frame(cbind(Identifier1 = c(M123, M123,
   M123, M123,
   M123),
Identifier2 = c(B23, B24, B23, B23, B24),
Identifier3 = c(VJHJ, VJHJ, VLKE, HKJH, LKJH),
Sequence = c(ATATATATATA, ATATATATATA,
 ATATAGCATATA, ATATATAGGGTA,
 ATCGCGCGAATA)))

# I am very grateful for your help! I am no whiz at R and everything I know
# is self-taught. Therefore, some basics can turn out to be quite some
# obsatcles for me. 
# /Anna

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. .• 
`•. .º

Anna Zakrisson Braeunlich
PhD student

Department of Ecology, Environment and Plant Sciences
Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige

Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland

E-mail: anna.zakris...@su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. .• 
`•. .º


From: Ista Zahn [istaz...@gmail.com]
Sent: 13 October 2014 15:42
To: Anna Zakrisson Braeunlich
Cc: r-help@r-project.org
Subject: Re: [R] seqinr ?: Splitting a factor name into several columns. 
Dealing with metabarcoding data.

Hi Anna,


On Sun, Oct 12, 2014 at 3:24 AM, Anna Zakrisson Braeunlich
anna.zakris...@su.se wrote:
 Hi,

 I have a question how to split a factor name into different columns. I have 
 metabarcoding data and need to merge the FASTA-file with the taxonomy- and 
 counttable files (dataframes). To be able to do this merge, I need to isolate 
 the common identifier, that unfortunately is baked in with a lot of other 
 labels in the factor name eg:
 sequence identifier: 
 M01271_77_0.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2

 I want to split this name at every . to get several columns:
 column1: M01271_77_0
 column2: A8J0P_1_1101_10150_1525
 column3: 1
 column4: 322519
 column5: sample_1
 column6: sample_2

 I must add that I have no influence on how these names are given. This is how 
 thay are supplied from Illumina Miseq. I just need to be able to deal with it.

 Here is some extremely simplified dummy data to further show the issue at 
 hand:

 df1 - data.frame(cbind(X = 1:10, Y = rnorm(10)),
   Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 
 5)))
 df2 - data.frame(cbind(B = 13:22, K = rnorm(10)),
   Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 
 5)))

 # I have metabarcoding data with one FASTA-file, one count table and one 
 taxonomy file
 # Above dummy data is just showing the issue at hand. I want to be able to 
 merge my three
 # original data frames (here, the dummy data is only two dataframes). The 
 problem is that
 # the only identifier that is commmon for the dataframes is hidden in the
 # factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence 
 need to be able
 # to split this name up into different columns to get identifierA alone as 
 one column name
 # Then I can merge the dataframes.
 # How can I do this in R. I know that it can be done in excel, but I would 
 like to
 # produce a complete R-script to get a fast pipeline and avoid copy and paste 
 errors.
 # This is what I want it to look:

 df1.goal - data.frame(cbind(X = 1:10, Y = rnorm(10)),
   Z = factor(rep(LETTERS[1:2], each = 5)),
   identifierA = factor(rep(LETTERS[1:2], each = 5)),
   B1298712 = factor(rep(LETTERS[1:2], each = 5)))

Use strsplit to separate the components, something like

separateNames - strsplit(names(df1)[3], split = \\.)[[1]]
for(name in separateNames) {
df1[[name]] - df1[[3]]
}
df1[[3]] - NULL


Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

2014-10-13 Thread Ista Zahn
Hi Anna,


On Sun, Oct 12, 2014 at 3:24 AM, Anna Zakrisson Braeunlich
anna.zakris...@su.se wrote:
 Hi,

 I have a question how to split a factor name into different columns. I have 
 metabarcoding data and need to merge the FASTA-file with the taxonomy- and 
 counttable files (dataframes). To be able to do this merge, I need to isolate 
 the common identifier, that unfortunately is baked in with a lot of other 
 labels in the factor name eg:
 sequence identifier: 
 M01271_77_0.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2

 I want to split this name at every . to get several columns:
 column1: M01271_77_0
 column2: A8J0P_1_1101_10150_1525
 column3: 1
 column4: 322519
 column5: sample_1
 column6: sample_2

 I must add that I have no influence on how these names are given. This is how 
 thay are supplied from Illumina Miseq. I just need to be able to deal with it.

 Here is some extremely simplified dummy data to further show the issue at 
 hand:

 df1 - data.frame(cbind(X = 1:10, Y = rnorm(10)),
   Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 
 5)))
 df2 - data.frame(cbind(B = 13:22, K = rnorm(10)),
   Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 
 5)))

 # I have metabarcoding data with one FASTA-file, one count table and one 
 taxonomy file
 # Above dummy data is just showing the issue at hand. I want to be able to 
 merge my three
 # original data frames (here, the dummy data is only two dataframes). The 
 problem is that
 # the only identifier that is commmon for the dataframes is hidden in the
 # factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence 
 need to be able
 # to split this name up into different columns to get identifierA alone as 
 one column name
 # Then I can merge the dataframes.
 # How can I do this in R. I know that it can be done in excel, but I would 
 like to
 # produce a complete R-script to get a fast pipeline and avoid copy and paste 
 errors.
 # This is what I want it to look:

 df1.goal - data.frame(cbind(X = 1:10, Y = rnorm(10)),
   Z = factor(rep(LETTERS[1:2], each = 5)),
   identifierA = factor(rep(LETTERS[1:2], each = 5)),
   B1298712 = factor(rep(LETTERS[1:2], each = 5)))

Use strsplit to separate the components, something like

separateNames - strsplit(names(df1)[3], split = \\.)[[1]]
for(name in separateNames) {
df1[[name]] - df1[[3]]
}
df1[[3]] - NULL

Best,
Ista


 # Many thank's and with kind regards
 Anna Zakrisson

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. 
.• `•. .º

 Anna Zakrisson Braeunlich
 PhD student

 Department of Ecology, Environment and Plant Sciences
 Stockholm University
 Svante Arrheniusv. 21A
 SE-106 91 Stockholm
 Sweden/Sverige

 Lives in Berlin.
 For paper mail:
 Katzbachstr. 21
 D-10965, Berlin
 Germany/Deutschland

 E-mail: anna.zakris...@su.se
 Tel work: +49-(0)3091541281
 Mobile: +49-(0)15777374888
 LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b

º`•. . • `•. .• `•. . º`•. . • `•. .• `•. .º`•. . • `•. 
.• `•. .º

 [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

2014-10-13 Thread David.Kaethner
I'm not sure I understood your problem, maybe like this:

# split identifiers into columns
df1 - data.frame(cbind(X = 1:10, Y = rnorm(10)),
  Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))

id - names(df1)[3]
x - do.call(rbind, str_split(id, \\.))
y - sapply(x, function(z) z - df1[,id])

df1.goal - data.frame(df1[,-3], y)

-dk

-Ursprüngliche Nachricht-
Von: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] Im 
Auftrag von Anna Zakrisson Braeunlich
Gesendet: Sonntag, 12. Oktober 2014 09:25
An: r-help@r-project.org
Betreff: [R] seqinr ?: Splitting a factor name into several columns. Dealing 
with metabarcoding data.

Hi,

I have a question how to split a factor name into different columns. I have 
metabarcoding data and need to merge the FASTA-file with the taxonomy- and 
counttable files (dataframes). To be able to do this merge, I need to isolate 
the common identifier, that unfortunately is baked in with a lot of other 
labels in the factor name eg:
sequence identifier: 
M01271_77_0.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2

I want to split this name at every . to get several columns:
column1: M01271_77_0
column2: A8J0P_1_1101_10150_1525
column3: 1
column4: 322519
column5: sample_1
column6: sample_2

I must add that I have no influence on how these names are given. This is how 
thay are supplied from Illumina Miseq. I just need to be able to deal with it.

Here is some extremely simplified dummy data to further show the issue at hand:

df1 - data.frame(cbind(X = 1:10, Y = rnorm(10)),
  Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
df2 - data.frame(cbind(B = 13:22, K = rnorm(10)),
  Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))

# I have metabarcoding data with one FASTA-file, one count table and one 
taxonomy file # Above dummy data is just showing the issue at hand. I want to 
be able to merge my three # original data frames (here, the dummy data is only 
two dataframes). The problem is that # the only identifier that is commmon for 
the dataframes is hidden in the # factor name eg: Z.identifierA.1298712 and 
Q.identifierA.4668726. I hence need to be able # to split this name up into 
different columns to get identifierA alone as one column name # Then I can 
merge the dataframes.
# How can I do this in R. I know that it can be done in excel, but I would like 
to # produce a complete R-script to get a fast pipeline and avoid copy and 
paste errors.
# This is what I want it to look:

df1.goal - data.frame(cbind(X = 1:10, Y = rnorm(10)),
  Z = factor(rep(LETTERS[1:2], each = 5)),
  identifierA = factor(rep(LETTERS[1:2], each = 5)),
  B1298712 = factor(rep(LETTERS[1:2], each = 5)))

# Many thank's and with kind regards
Anna Zakrisson

 ` . .   ` . .  ` . .  ` . .   ` . .  ` . . ` . .   
` . .  ` . . 

Anna Zakrisson Braeunlich
PhD student

Department of Ecology, Environment and Plant Sciences Stockholm University 
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige

Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland

E-mail: anna.zakris...@su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b

 ` . .   ` . .  ` . .  ` . .   ` . .  ` . . ` . .   
` . .  ` . . 

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.