subject:"\[R\] merging data frames"

Re: [R] merging data frames

2013-06-14 Thread Jim Holtman

?merge

Sent from my iPad

On Jun 14, 2013, at 0:51, Yasin Gocgun entropy...@gmail.com wrote:

 Hi,
 
 I have been struggling with the issue of merging data frames that have
 common columns and have different dimensions. Although I made alot of
 search about it on internet, I could not find any function that would
 efficiently perform the required operation. So I would appreciate if anyone
 knowing how to resolve the problem would explain me the solution.
 
 As you will see, the below data frames have one common column (they would
 have multiple common columns in general), and I simply want to create a
 table that is the union of A and B, say table C. So the first row of C must
 include all the necessary info about the person with the respective COMPID,
 which is provided in A and B.
 
 A:
 COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL  1030956 XXGRX P 10
 19890919 19890919 0  2511425 XXXRX T 6 19891005 19891030 0  3205129 XXGRX T
 8 19900227 19900227 0 ...
 
 B:
 COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR  112 0 NO 1 19800122 5
 121 0 NO 2 19800121 5  130 0 NO 3 19800121 5  149 0 NO 4
 19800121 5 ...
 
 I tried rbind.fill but it did not work. I am aware that such operations can
 be done in SAS in a minute, so I thought R should be as efficient as SAS in
 performing such operations...
 
 Thank you in advance,
 
 Yasin
 
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames

2013-06-14 Thread Yasin Gocgun

Thanks for your responses.
I have already found that merge function performs what I am looking for.

On Fri, Jun 14, 2013 at 12:51 AM, Yasin Gocgun entropy...@gmail.com wrote:

 Hi,

 I have been struggling with the issue of merging data frames that have
 common columns and have different dimensions. Although I made alot of
 search about it on internet, I could not find any function that would
 efficiently perform the required operation. So I would appreciate if anyone
 knowing how to resolve the problem would explain me the solution.

 As you will see, the below data frames have one common column (they would
 have multiple common columns in general), and I simply want to create a
 table that is the union of A and B, say table C. So the first row of C must
 include all the necessary info about the person with the respective COMPID,
 which is provided in A and B.

 A:
  COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL  1030956 XXGRX P
 10 19890919 19890919 0  2511425 XXXRX T 6 19891005 19891030 0  3205129
 XXGRX T 8 19900227 19900227 0 ...

 B:
  COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR  112 0 NO 1 19800122 5
 121 0 NO 2 19800121 5  130 0 NO 3 19800121 5  149 0 NO 4
 19800121 5 ...

 I tried rbind.fill but it did not work. I am aware that such operations
 can be done in SAS in a minute, so I thought R should be as efficient as
 SAS in performing such operations...

 Thank you in advance,

 Yasin




-- 
Yasin Gocgun

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] merging data frames

2013-06-13 Thread Yasin Gocgun

Hi,

I have been struggling with the issue of merging data frames that have
common columns and have different dimensions. Although I made alot of
search about it on internet, I could not find any function that would
efficiently perform the required operation. So I would appreciate if anyone
knowing how to resolve the problem would explain me the solution.

As you will see, the below data frames have one common column (they would
have multiple common columns in general), and I simply want to create a
table that is the union of A and B, say table C. So the first row of C must
include all the necessary info about the person with the respective COMPID,
which is provided in A and B.

A:
 COMPID CLR_DOT CA_TYPE CA_YEAR DT_FNDNG DT_BIOP NORMAL  1030956 XXGRX P 10
19890919 19890919 0  2511425 XXXRX T 6 19891005 19891030 0  3205129 XXGRX T
8 19900227 19900227 0 ...

B:
 COMPID CNTR_ALL ALLOC AG_GRCAL ENRL_DT EXP_SCR  112 0 NO 1 19800122 5
121 0 NO 2 19800121 5  130 0 NO 3 19800121 5  149 0 NO 4
19800121 5 ...

I tried rbind.fill but it did not work. I am aware that such operations can
be done in SAS in a minute, so I thought R should be as efficient as SAS in
performing such operations...

Thank you in advance,

Yasin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging Data Frames in R

2012-07-18 Thread Yasir Kaheil

type ?merge in R

-
Yasir Kaheil

--
View this message in context: 
http://r.789695.n4.nabble.com/Merging-Data-Frames-in-R-tp4636781p4636962.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames one of which is NULL

2010-11-09 Thread Dimitri Liakhovitski

Hello!

I am running a loop. The result of each run of the loop is a data
frame. I am merging all the data frames.
For exampe:

The dataframe from run 1:
x-data.frame(a=1,b=2,c=3)

The dataframe from run 2:
y-data.frame(a=10,b=20,d=30)

What I want to get is:
merge(x,y,all.x=T,all.y=T)

Then I want to merge it with the output of the 3rd run, etc.

Unfortunately, I can't create the placeholder for the overall resutls
BEFORE I run the loop because I don't even know how many columns I'll
end up with - after merging all the data frames.
I was thinking of creating an empty list:

first-NULL

...and then updating it during each run by merging it with the data
frame that is the output of the run. However, when I try to merge the
empty list with any non-empty data frame - it ends up empty:
merge(first,a,,all.x=T,all.y=T)

Is there a way to make it merge while keeping everything?
Thanks a lot!
-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames one of which is NULL

2010-11-09 Thread Joshua Wiley

Hi Dimitri,

I have some doubts whether storing the results of a loop in a data
frame and merging it with every run is the most efficient way of doing
things, but I do not know your situation.  This does what you want, I
believe, but I suspect it could be quite slow.  I worked around the
placeholder issue using an if statement.

HTH,

Josh

for (i in 1:10) {
  x - data.frame(a = 1, b = 2, c = i)
  if (i == 1) {
y - x
  } else {
y - merge(x, y, all.x = TRUE, all.y = TRUE)
  }
}

On Tue, Nov 9, 2010 at 8:42 AM, Dimitri Liakhovitski
dimitri.liakhovit...@gmail.com wrote:
 Hello!

 I am running a loop. The result of each run of the loop is a data
 frame. I am merging all the data frames.
 For exampe:

 The dataframe from run 1:
 x-data.frame(a=1,b=2,c=3)

 The dataframe from run 2:
 y-data.frame(a=10,b=20,d=30)

 What I want to get is:
 merge(x,y,all.x=T,all.y=T)

 Then I want to merge it with the output of the 3rd run, etc.

 Unfortunately, I can't create the placeholder for the overall resutls
 BEFORE I run the loop because I don't even know how many columns I'll
 end up with - after merging all the data frames.
 I was thinking of creating an empty list:

 first-NULL

 ...and then updating it during each run by merging it with the data
 frame that is the output of the run. However, when I try to merge the
 empty list with any non-empty data frame - it ends up empty:
 merge(first,a,,all.x=T,all.y=T)

 Is there a way to make it merge while keeping everything?
 Thanks a lot!
 --
 Dimitri Liakhovitski
 Ninah Consulting
 www.ninah.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames one of which is NULL

2010-11-09 Thread Dimitri Liakhovitski

Thanks a lot, Joshua.
You might be right.
I am thinking of creating a list (as a placeholder) and then merging
the elements of the list.
Dimitri

On Tue, Nov 9, 2010 at 12:11 PM, Joshua Wiley jwiley.ps...@gmail.com wrote:
 Hi Dimitri,

 I have some doubts whether storing the results of a loop in a data
 frame and merging it with every run is the most efficient way of doing
 things, but I do not know your situation.  This does what you want, I
 believe, but I suspect it could be quite slow.  I worked around the
 placeholder issue using an if statement.

 HTH,

 Josh

 for (i in 1:10) {
  x - data.frame(a = 1, b = 2, c = i)
  if (i == 1) {
    y - x
  } else {
    y - merge(x, y, all.x = TRUE, all.y = TRUE)
  }
 }

 On Tue, Nov 9, 2010 at 8:42 AM, Dimitri Liakhovitski
 dimitri.liakhovit...@gmail.com wrote:
 Hello!

 I am running a loop. The result of each run of the loop is a data
 frame. I am merging all the data frames.
 For exampe:

 The dataframe from run 1:
 x-data.frame(a=1,b=2,c=3)

 The dataframe from run 2:
 y-data.frame(a=10,b=20,d=30)

 What I want to get is:
 merge(x,y,all.x=T,all.y=T)

 Then I want to merge it with the output of the 3rd run, etc.

 Unfortunately, I can't create the placeholder for the overall resutls
 BEFORE I run the loop because I don't even know how many columns I'll
 end up with - after merging all the data frames.
 I was thinking of creating an empty list:

 first-NULL

 ...and then updating it during each run by merging it with the data
 frame that is the output of the run. However, when I try to merge the
 empty list with any non-empty data frame - it ends up empty:
 merge(first,a,,all.x=T,all.y=T)

 Is there a way to make it merge while keeping everything?
 Thanks a lot!
 --
 Dimitri Liakhovitski
 Ninah Consulting
 www.ninah.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Joshua Wiley
 Ph.D. Student, Health Psychology
 University of California, Los Angeles
 http://www.joshuawiley.com/




-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames one of which is NULL

2010-11-09 Thread Phil Spector


Dimitri -
   Usually the easiest way to solve problems like this
is to put all the dataframes in a list, and then use
the Reduce() function to merge them all together at the
end.  You don't give many details about how the data frames
are constructed, so it's hard to be specific about the
best way to put them in a list, but this short 
example should give you an idea of what I'm talking about:



x-data.frame(a=1,b=2,c=3)
y-data.frame(a=10,b=20,d=30)
z-data.frame(a=12,b=19,f=25)
a-data.frame(a=9,b=10,g=15)
Reduce(function(x,y)merge(x,y,all=TRUE),list(x,y,z,a))

   a  b  c  d  f  g
1  1  2  3 NA NA NA
2  9 10 NA NA NA 15
3 10 20 NA 30 NA NA
4 12 19 NA NA 25 NA

Hope this helps.
- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu





On Tue, 9 Nov 2010, Dimitri Liakhovitski wrote:


Hello!

I am running a loop. The result of each run of the loop is a data
frame. I am merging all the data frames.
For exampe:

The dataframe from run 1:
x-data.frame(a=1,b=2,c=3)

The dataframe from run 2:
y-data.frame(a=10,b=20,d=30)

What I want to get is:
merge(x,y,all.x=T,all.y=T)

Then I want to merge it with the output of the 3rd run, etc.

Unfortunately, I can't create the placeholder for the overall resutls
BEFORE I run the loop because I don't even know how many columns I'll
end up with - after merging all the data frames.
I was thinking of creating an empty list:

first-NULL

...and then updating it during each run by merging it with the data
frame that is the output of the run. However, when I try to merge the
empty list with any non-empty data frame - it ends up empty:
merge(first,a,,all.x=T,all.y=T)

Is there a way to make it merge while keeping everything?
Thanks a lot!
--
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames one of which is NULL

2010-11-09 Thread Dimitri Liakhovitski

Thanks a lot, Phil.
I decided to do it via the list - as you suggested, but had to do some
gymnastics, which Reduce will greatly help me to avoid now!
Dimitri

On Tue, Nov 9, 2010 at 12:36 PM, Phil Spector spec...@stat.berkeley.edu wrote:
 Dimitri -
   Usually the easiest way to solve problems like this
 is to put all the dataframes in a list, and then use
 the Reduce() function to merge them all together at the
 end.  You don't give many details about how the data frames
 are constructed, so it's hard to be specific about the
 best way to put them in a list, but this short example should give you an
 idea of what I'm talking about:

 x-data.frame(a=1,b=2,c=3)
 y-data.frame(a=10,b=20,d=30)
 z-data.frame(a=12,b=19,f=25)
 a-data.frame(a=9,b=10,g=15)
 Reduce(function(x,y)merge(x,y,all=TRUE),list(x,y,z,a))

   a  b  c  d  f  g
 1  1  2  3 NA NA NA
 2  9 10 NA NA NA 15
 3 10 20 NA 30 NA NA
 4 12 19 NA NA 25 NA

 Hope this helps.
                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu





 On Tue, 9 Nov 2010, Dimitri Liakhovitski wrote:

 Hello!

 I am running a loop. The result of each run of the loop is a data
 frame. I am merging all the data frames.
 For exampe:

 The dataframe from run 1:
 x-data.frame(a=1,b=2,c=3)

 The dataframe from run 2:
 y-data.frame(a=10,b=20,d=30)

 What I want to get is:
 merge(x,y,all.x=T,all.y=T)

 Then I want to merge it with the output of the 3rd run, etc.

 Unfortunately, I can't create the placeholder for the overall resutls
 BEFORE I run the loop because I don't even know how many columns I'll
 end up with - after merging all the data frames.
 I was thinking of creating an empty list:

 first-NULL

 ...and then updating it during each run by merging it with the data
 frame that is the output of the run. However, when I try to merge the
 empty list with any non-empty data frame - it ends up empty:
 merge(first,a,,all.x=T,all.y=T)

 Is there a way to make it merge while keeping everything?
 Thanks a lot!
 --
 Dimitri Liakhovitski
 Ninah Consulting
 www.ninah.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames on a variety of columns

2010-09-17 Thread Chris Poliquin

Hello,

This is a semi-complicated question about comparing two datasets,
probably using merge, but I am open to other ideas.  I have a large
frame of information about companies.  It's over 30,000 rows and looks
something like...

df1 -

identifier1 identifier2nameother_nameyear
   H34   C56   ACME   ACME_LTD   2001
   H34   NAACME   ACME_LTD   2002
   X20   C40   FOO_COFOO_CO  2004
   NANABAR_SABAR_SAB2004
   NANABAR_SABAR_SAB2005

As you can see, many observations are missing values.
I have a second data frame with information about these same
companies, in fewer rows, and often with slightly different info...

df2 -

identifier1 identifier2name   year
   H34   NAACME_LTD  2001
   H34   NAACME_LTD  2002
   X20   C40   FOO2004

The idea is to figure out which companies in the first set are not in
the second set.  My approach so far is to do various merges and then
remove the matches from the original data frame...

m1 - merge(df1, df2, by = c(identifier1, identifier2, year),
incomparables=NA)
m2 - merge(df1, df2, by = c(name, year), incomparables=NA)
m3 - merge(df1, df2, by.x = c(other_name, year), by.y = c(name,
year), incomparables = NA)


Is this really the best way to accomplish my goal?

Also, for some reason when I do merges like m3, my resulting data
frame is missing columns and I am getting rows that do not appear to
match on the variables I have specified, e.g. ...

year other_name   identifier1name
identifier2
2001  AMDOCS_LTDG0260210 AMDOCS_LTDED C42913


Help is much appreciated,
Chris

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] merging data frames

2010-06-14 Thread Assa Yeroslaviz

Hi,

is it possible to merge two data frames while preserving the row names of
the bigger data frame?

I have two data frames which  i would like to combine. While doing so I
always loose the row names. When I try to append this, I get the error
message, that I have non-unique names. This although I used unique command
on the data frame where the double inputs supposedly are

thanks for the help

Assa

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames

2010-06-14 Thread jim holtman

Put the rownames as another column in your dataframe so that it
remains with the data.  After merging, you can then use it as the
rownames

On Mon, Jun 14, 2010 at 9:25 AM, Assa Yeroslaviz fry...@gmail.com wrote:
 Hi,

 is it possible to merge two data frames while preserving the row names of
 the bigger data frame?

 I have two data frames which  i would like to combine. While doing so I
 always loose the row names. When I try to append this, I get the error
 message, that I have non-unique names. This although I used unique command
 on the data frame where the double inputs supposedly are

 thanks for the help

 Assa

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames

2010-06-14 Thread jim holtman

If you want to keep only the rows that are unique in the first column
then do the following:

workComb1 - subset(workComb, !duplicated(ProbeID))

On Mon, Jun 14, 2010 at 11:20 AM, Assa Yeroslaviz fry...@gmail.com wrote:
 well, the problem is basically elsewhere. I have a data frame with
 expression data and doubled IDs in the first column (see example)
 when I want to put them into row names I get the message, that there are
 non-unique items in the data.
 So I tried with unique to delete such rows. The problem is unique doesn't
 delete all of them.

 I compare two data frames with their Probe IDs.
 I would like to delete all double lines with a certain probe ID independent
 from the rest of the line, as to say I would like a data frame with single
 unique idetifiers in the Probe Id column.
 merge doesn't give me that. It doesn't delete all similar line, if the line
 are not identical in the other columns it leaves them in the table.

 Is there a way of deleting whole the line with double Probe IDs?

 workbook - read.delim(file = workbook1.txt, quote = , sep = \t)
 GeneID - read.delim(file = testTable.txt, quote = , sep = \t)
 workComb - merge(workbook, GeneID, by.x = ProbeID, by.y = Probe.Id)
 workComb1 - unique(workComb)
 write.table(workComb, file = workComb.txt , sep = \t, quote = FALSE,
 row.names = FALSE)
 write.table(workComb1, file = workComb1.txt , sep = \t, quote = FALSE,
 row.names = FALSE)

 look at lines 49 and 50 in the file workComb1.txt after using unique on the
 file. The line are identical  with the exception of the Transcript ID. I
 would like to take one of them out of the table.

 THX,

 Assa

 On Mon, Jun 14, 2010 at 15:33, jim holtman jholt...@gmail.com wrote:

 Put the rownames as another column in your dataframe so that it
 remains with the data.  After merging, you can then use it as the
 rownames

 On Mon, Jun 14, 2010 at 9:25 AM, Assa Yeroslaviz fry...@gmail.com wrote:
  Hi,
 
  is it possible to merge two data frames while preserving the row names
  of
  the bigger data frame?
 
  I have two data frames which  i would like to combine. While doing so I
  always loose the row names. When I try to append this, I get the error
  message, that I have non-unique names. This although I used unique
  command
  on the data frame where the double inputs supposedly are
 
  thanks for the help
 
  Assa
 
         [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?





-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

Hi Guys

I have two data frames which I would like to merge on two conditions.

I am doing the following  (abstract form)

new.data.frame - merge(df1,df2, by=c(Col1,Col2))

It is giving me a null result.

Basically I need to apply two conditions.

I also tried sqldf but it is running forever. Will indexing help ?

temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM
+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ )

Thanks!
-Abhi

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread David Winsemius



On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:


Hi Guys

I have two data frames which I would like to merge on two conditions.

I am doing the following  (abstract form)

new.data.frame - merge(df1,df2, by=c(Col1,Col2))


What does

 str(df1) ; str(df2)

... show?




It is giving me a null result.

Basically I need to apply two conditions.

I also tried sqldf but it is running forever. Will indexing help ?

temp - sqldf(select  
a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM

+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ )

Thanks!
-Abhi

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

Hi David

Here it is. You can ignore the bio jargon if it sounds confusing. The
corresponding data type of column (SNP, chr) on which I am applying merge is
same.

merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


str(data_lane6_snps)
'data.frame':   7724462 obs. of  10 variables:
 $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
 $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1
5 ...
 $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8 2
2 ...
 $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
 $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
 $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
 $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
 $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998
49793 155731 284998 416878 133393 133393 ...
 $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359 359 359
54813 92856 92856 92856 92856 92539 55424 ...

 str(data_lane6_snps_rsid)
'data.frame':   797807 obs. of  4 variables:
 $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
 $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
57122299 41899656 76990037 ...
 $ end : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
57122299 41899656 76990037 ...
 $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690
505395 470219 780326 29342 29263 327909 434159 723152 ...


On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.netwrote:


 On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

  Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))


 What does

  str(df1) ; str(df2)

 ... show?



 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
 FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 David Winsemius, MD
 West Hartford, CT



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

And I should also add that if I merge only on one column it works fine but
the result is not what I want.

merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP) : works as
expected.

Is the chr column being a factor creating probs here ?

-A

On Tue, Apr 6, 2010 at 4:03 PM, Abhishek Pratap abhishek@gmail.comwrote:

 Hi David

 Here it is. You can ignore the bio jargon if it sounds confusing. The
 corresponding data type of column (SNP, chr) on which I am applying merge is
 same.

 merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


 str(data_lane6_snps)
 'data.frame':   7724462 obs. of  10 variables:
  $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1 1
 1 ...
  $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
  $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2 1
 5 ...
  $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8
 2 2 ...
  $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
  $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
  $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
  $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
  $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998
 49793 155731 284998 416878 133393 133393 ...
  $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359 359 359
 54813 92856 92856 92856 92856 92539 55424 ...

  str(data_lane6_snps_rsid)
 'data.frame':   797807 obs. of  4 variables:
  $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
  $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...
  $ end : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...
  $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690
 505395 470219 780326 29342 29263 327909 434159 723152 ...


 On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.netwrote:


 On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

  Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))


 What does

  str(df1) ; str(df2)

 ... show?



 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
 FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 David Winsemius, MD
 West Hartford, CT




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread David Winsemius



On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:


Hi David

Here it is. You can ignore the bio jargon if it sounds confusing.


Sometimes it is essential to have domain details.

The corresponding data type of column (SNP, chr) on which I am  
applying merge is same.


merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


str(data_lane6_snps)
'data.frame':   7724462 obs. of  10 variables:
 $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1  
1 1 1 1 1 ...

 $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
 $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2  
5 2 2 1 5 ...
 $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2  
2 3 8 2 2 ...

 $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
 $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
 $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
 $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
 $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5  
410998 49793 155731 284998 416878 133393 133393 ...
 $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359  
359 359 54813 92856 92856 92856 92856 92539 55424 ...


 str(data_lane6_snps_rsid)
'data.frame':   797807 obs. of  4 variables:
 $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
 $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693  
3921381 57122299 41899656 76990037 ...


Looking at this line and the line for SNP in the above dataframe I  
am not seeing that these are exhibiting much similarity in range.  
There are 10 times few observations. What was you plan for the non- 
matching cases? Did you really mean that you wanted a right outer join?


You might get information by trying:

length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

That would tell you how many potential matches you might have on the  
basis of SNP numbers, Although an SNP match might or might not be a  
full match given the chr matching that is also being specified.



 $ end : int  68143872 11071026 69423434 12394791 1302846 95330693  
3921381 57122299 41899656 76990037 ...
 $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229  
685690 505395 470219 780326 29342 29263 327909 434159 723152 ...



On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net 
 wrote:


On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

Hi Guys

I have two data frames which I would like to merge on two conditions.

I am doing the following  (abstract form)

new.data.frame - merge(df1,df2, by=c(Col1,Col2))


So I am guessing that you really wanted just this:

new.data.frame - merge(df1,df2)

?merge

Since the default for merge is:  by = intersect(names(x), names(y)),  
this would have been equivalent to


new.data.frame - merge(df1,df2, by=c(chr, SNP) )

See above regarding the possibility that you have non-congruent SNP  
labeling problems.






What does

 str(df1) ; str(df2)

... show?



It is giving me a null result.

Basically I need to apply two conditions.

I also tried sqldf but it is running forever. Will indexing help ?

temp - sqldf(select  
a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM

+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ )

Thanks!
-Abhi

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT




David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

Hi David

I can understand looking the SNP data values it can be felt that they are
different values and hence no result in merge. However the columns still
have ~700K SNPs common. What I am looking for is a merge where the SNP and
Chr matches. If I match only the SNP column I get partially correct results
since it is possible for two chromosomes to have a SNP at the same bp
location so the merge needs to take both SNP position and Chromosome into
account.

Thanks!
-Abhi

On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.netwrote:


 On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:

  Hi David

 Here it is. You can ignore the bio jargon if it sounds confusing.


 Sometimes it is essential to have domain details.


  The corresponding data type of column (SNP, chr) on which I am applying
 merge is same.

 merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


 str(data_lane6_snps)
 'data.frame':   7724462 obs. of  10 variables:
  $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1
 1 1 ...
  $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
  $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2
 1 5 ...
  $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8
 2 2 ...
  $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
  $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
  $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
  $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
  $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998
 49793 155731 284998 416878 133393 133393 ...
  $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359 359 359
 54813 92856 92856 92856 92856 92539 55424 ...

  str(data_lane6_snps_rsid)
 'data.frame':   797807 obs. of  4 variables:
  $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
  $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...


 Looking at this line and the line for SNP in the above dataframe I am not
 seeing that these are exhibiting much similarity in range. There are 10
 times few observations. What was you plan for the non-matching cases? Did
 you really mean that you wanted a right outer join?

 You might get information by trying:

 length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

 That would tell you how many potential matches you might have on the basis
 of SNP numbers, Although an SNP match might or might not be a full match
 given the chr matching that is also being specified.



   $ end : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...
  $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690
 505395 470219 780326 29342 29263 327909 434159 723152 ...


 On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net
 wrote:

 On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

 Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))


 So I am guessing that you really wanted just this:

 new.data.frame - merge(df1,df2)

 ?merge

 Since the default for merge is:  by = intersect(names(x), names(y)), this
 would have been equivalent to

 new.data.frame - merge(df1,df2, by=c(chr, SNP) )

 See above regarding the possibility that you have non-congruent SNP
 labeling problems.





 What does

  str(df1) ; str(df2)

 ... show?



 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
 FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 David Winsemius, MD
 West Hartford, CT



 David Winsemius, MD
 West Hartford, CT



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

Just so you know

length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
796120

I just need to include the chr condition now where I am stuck.

-Abhi

On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.comwrote:

 Hi David

 I can understand looking the SNP data values it can be felt that they are
 different values and hence no result in merge. However the columns still
 have ~700K SNPs common. What I am looking for is a merge where the SNP and
 Chr matches. If I match only the SNP column I get partially correct results
 since it is possible for two chromosomes to have a SNP at the same bp
 location so the merge needs to take both SNP position and Chromosome into
 account.

 Thanks!
 -Abhi


 On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.netwrote:


 On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:

  Hi David

 Here it is. You can ignore the bio jargon if it sounds confusing.


 Sometimes it is essential to have domain details.


  The corresponding data type of column (SNP, chr) on which I am applying
 merge is same.

 merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


 str(data_lane6_snps)
 'data.frame':   7724462 obs. of  10 variables:
  $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1
 1 1 ...
  $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
  $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2
 1 5 ...
  $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3
 8 2 2 ...
  $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
  $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
  $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
  $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
  $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5
 410998 49793 155731 284998 416878 133393 133393 ...
  $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359 359 359
 54813 92856 92856 92856 92856 92539 55424 ...

  str(data_lane6_snps_rsid)
 'data.frame':   797807 obs. of  4 variables:
  $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
  $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693
 3921381 57122299 41899656 76990037 ...


 Looking at this line and the line for SNP in the above dataframe I am
 not seeing that these are exhibiting much similarity in range. There are 10
 times few observations. What was you plan for the non-matching cases? Did
 you really mean that you wanted a right outer join?

 You might get information by trying:

 length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

 That would tell you how many potential matches you might have on the basis
 of SNP numbers, Although an SNP match might or might not be a full match
 given the chr matching that is also being specified.



   $ end : int  68143872 11071026 69423434 12394791 1302846 95330693
 3921381 57122299 41899656 76990037 ...
  $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690
 505395 470219 780326 29342 29263 327909 434159 723152 ...


 On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net
 wrote:

 On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

 Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))


 So I am guessing that you really wanted just this:

 new.data.frame - merge(df1,df2)

 ?merge

 Since the default for merge is:  by = intersect(names(x), names(y)), this
 would have been equivalent to

 new.data.frame - merge(df1,df2, by=c(chr, SNP) )

 See above regarding the possibility that you have non-congruent SNP
 labeling problems.





 What does

  str(df1) ; str(df2)

 ... show?



 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
 FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 David Winsemius, MD
 West Hartford, CT



 David Winsemius, MD
 West Hartford, CT




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames on two conditions

2010-04-06 Thread David Winsemius

OK, not the SNP's. So look at the chr's. I will bet that you get 0  
when you try :


length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr))


... since one is using a format of chrNN and the other is using just  
NN. You need to get the chromosome naming convention straightened out.


--
David.

On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote:


Just so you know

length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
796120

I just need to include the chr condition now where I am stuck.

-Abhi

On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.com 
 wrote:

Hi David

I can understand looking the SNP data values it can be felt that  
they are different values and hence no result in merge. However the  
columns still have ~700K SNPs common. What I am looking for is a  
merge where the SNP and Chr matches. If I match only the SNP column  
I get partially correct results since it is possible for two  
chromosomes to have a SNP at the same bp location so the merge needs  
to take both SNP position and Chromosome into account.


Thanks!
-Abhi


On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.net 
 wrote:


On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:

Hi David

Here it is. You can ignore the bio jargon if it sounds confusing.

Sometimes it is essential to have domain details.


The corresponding data type of column (SNP, chr) on which I am  
applying merge is same.


merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


str(data_lane6_snps)
'data.frame':   7724462 obs. of  10 variables:
 $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1  
1 1 1 1 1 ...

 $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
 $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2  
5 2 2 1 5 ...
 $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2  
2 3 8 2 2 ...

 $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
 $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
 $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
 $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
 $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5  
410998 49793 155731 284998 416878 133393 133393 ...
 $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359  
359 359 54813 92856 92856 92856 92856 92539 55424 ...


 str(data_lane6_snps_rsid)
'data.frame':   797807 obs. of  4 variables:
 $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
 $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693  
3921381 57122299 41899656 76990037 ...


Looking at this line and the line for SNP in the above dataframe I  
am not seeing that these are exhibiting much similarity in range.  
There are 10 times few observations. What was you plan for the non- 
matching cases? Did you really mean that you wanted a right outer  
join?


You might get information by trying:

length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

That would tell you how many potential matches you might have on the  
basis of SNP numbers, Although an SNP match might or might not be a  
full match given the chr matching that is also being specified.




 $ end : int  68143872 11071026 69423434 12394791 1302846 95330693  
3921381 57122299 41899656 76990037 ...
 $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229  
685690 505395 470219 780326 29342 29263 327909 434159 723152 ...



On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net 
 wrote:


On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

Hi Guys

I have two data frames which I would like to merge on two conditions.

I am doing the following  (abstract form)

new.data.frame - merge(df1,df2, by=c(Col1,Col2))

So I am guessing that you really wanted just this:

new.data.frame - merge(df1,df2)

?merge

Since the default for merge is:  by = intersect(names(x), names(y)),  
this would have been equivalent to


new.data.frame - merge(df1,df2, by=c(chr, SNP) )

See above regarding the possibility that you have non-congruent SNP  
labeling problems.






What does

 str(df1) ; str(df2)

... show?



It is giving me a null result.

Basically I need to apply two conditions.

I also tried sqldf but it is running forever. Will indexing help ?

temp - sqldf(select  
a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM

+ data_lane6_snps a,
+ data_lane6_snps_rsid b
+ WHERE
+ a.SNP = b.SNP
+ AND
+ a.chr = b.chr
+ )

Thanks!
-Abhi

  [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



David Winsemius, MD
West Hartford, CT





David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Abhishek Pratap

You got the error. It is different naming convention of chr. I should be
able to fix that pretty easily.

In case the problem persists, I will contact the list.

Thanks!
-Abhi

On Tue, Apr 6, 2010 at 5:01 PM, David Winsemius dwinsem...@comcast.netwrote:

 OK, not the SNP's. So look at the chr's. I will bet that you get 0 when
 you try :

 length(intersect(data_lane6_snps$chr, data_lane6_snps_rsid$chr))


 ... since one is using a format of chrNN and the other is using just
 NN. You need to get the chromosome naming convention straightened out.

 --
 David.


 On Apr 6, 2010, at 4:53 PM, Abhishek Pratap wrote:

  Just so you know

 length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))
 796120

 I just need to include the chr condition now where I am stuck.

 -Abhi

 On Tue, Apr 6, 2010 at 4:51 PM, Abhishek Pratap abhishek@gmail.com
 wrote:
 Hi David

 I can understand looking the SNP data values it can be felt that they are
 different values and hence no result in merge. However the columns still
 have ~700K SNPs common. What I am looking for is a merge where the SNP and
 Chr matches. If I match only the SNP column I get partially correct results
 since it is possible for two chromosomes to have a SNP at the same bp
 location so the merge needs to take both SNP position and Chromosome into
 account.

 Thanks!
 -Abhi


 On Tue, Apr 6, 2010 at 4:42 PM, David Winsemius dwinsem...@comcast.net
 wrote:

 On Apr 6, 2010, at 4:03 PM, Abhishek Pratap wrote:

 Hi David

 Here it is. You can ignore the bio jargon if it sounds confusing.

 Sometimes it is essential to have domain details.


 The corresponding data type of column (SNP, chr) on which I am applying
 merge is same.

 merge(data_lane6_snps, data_lane6_snps_rsid , by = c(SNP,chr))


 str(data_lane6_snps)
 'data.frame':   7724462 obs. of  10 variables:
  $ chr   : Factor w/ 25 levels chr1,chr10,..: 1 1 1 1 1 1 1 1
 1 1 ...
  $ SNP   : int  100 101 103 108 179 180 191 197 218 222 ...
  $ reference : Factor w/ 5 levels A,C,G,N,..: 2 2 5 2 2 5 2 2
 1 5 ...
  $ genotype  : Factor w/ 10 levels A,C,G,K,..: 1 1 1 8 2 2 3 8
 2 2 ...
  $ consensus_qual: int  0 0 0 4 33 33 19 19 19 19 ...
  $ snp_qual  : int  0 0 0 4 0 33 19 19 19 19 ...
  $ rms_qual  : int  0 0 0 0 21 21 21 21 21 21 ...
  $ depth : int  1 1 1 1 2 2 2 2 2 2 ...
  $ bases : Factor w/ 453774 levels ^!,,^!,^!,,..: 5 5 5 410998
 49793 155731 284998 416878 133393 133393 ...
  $ base_quality  : Factor w/ 555104 levels `,``,```,..: 359 359 359
 54813 92856 92856 92856 92856 92539 55424 ...

  str(data_lane6_snps_rsid)
 'data.frame':   797807 obs. of  4 variables:
  $ chr : Factor w/ 24 levels 1,10,11,..: 3 3 3 3 3 3 3 3 3 3 ...
  $ SNP : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...

 Looking at this line and the line for SNP in the above dataframe I am
 not seeing that these are exhibiting much similarity in range. There are 10
 times few observations. What was you plan for the non-matching cases? Did
 you really mean that you wanted a right outer join?

 You might get information by trying:

 length(intersect(data_lane6_snps$SNP, data_lane6_snps_rsid$SNP))

 That would tell you how many potential matches you might have on the basis
 of SNP numbers, Although an SNP match might or might not be a full match
 given the chr matching that is also being specified.



  $ end : int  68143872 11071026 69423434 12394791 1302846 95330693 3921381
 57122299 41899656 76990037 ...
  $ rsid: Factor w/ 797807 levels rs10,rs1010,..: 100229 685690
 505395 470219 780326 29342 29263 327909 434159 723152 ...


 On Tue, Apr 6, 2010 at 3:59 PM, David Winsemius dwinsem...@comcast.net
 wrote:

 On Apr 6, 2010, at 3:54 PM, Abhishek Pratap wrote:

 Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))

 So I am guessing that you really wanted just this:

 new.data.frame - merge(df1,df2)

 ?merge

 Since the default for merge is:  by = intersect(names(x), names(y)), this
 would have been equivalent to

 new.data.frame - merge(df1,df2, by=c(chr, SNP) )

 See above regarding the possibility that you have non-congruent SNP
 labeling problems.





 What does

  str(df1) ; str(df2)

 ... show?



 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid
 FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

  [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented,

Re: [R] Merging data frames on two conditions

2010-04-06 Thread Gabor Grothendieck

Yes, indexing will typically make a large difference.

On Tue, Apr 6, 2010 at 3:54 PM, Abhishek Pratap abhishek@gmail.com wrote:
 Hi Guys

 I have two data frames which I would like to merge on two conditions.

 I am doing the following  (abstract form)

 new.data.frame - merge(df1,df2, by=c(Col1,Col2))

 It is giving me a null result.

 Basically I need to apply two conditions.

 I also tried sqldf but it is running forever. Will indexing help ?

 temp - sqldf(select a.chr,a.SNP,a.snp_qual,a.rms_qual,a.depth,b.rsid FROM
 + data_lane6_snps a,
 + data_lane6_snps_rsid b
 + WHERE
 + a.SNP = b.SNP
 + AND
 + a.chr = b.chr
 + )

 Thanks!
 -Abhi

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-02 Thread James Rome


David,

Now the code is:
for (j in seq_along(rwy)) { # subset the data and merge them
ar4rw = ar4rw - subset(arrgnd, arrgnd$Runway==rwy[j])
if(j == 1) {
arrw = ar4rw
}
else {
arrw = merge(arrw, ar4rw)
}
}

I attach the data. I needed 500 rows to get both runways in rwy.

The suggestions did not help much, but did get rid of the row of NAs in
ar4rw. Why?
When I run through the loop for 2 runways, I get

# j = 1, Runway = 31L
Browse[1] arrw[1:3,]
DateTime Date month hour minute quarter weekday IATA ICAO Flight
552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22
563 1/1/09 23:17 2009-01-01 1 23 17 93 5 DL DAL DAL242
565 1/1/09 23:24 2009-01-01 1 23 24 93 5 DL DAL DAL624
AircraftType Tail Arrived STA Runway FromTo Delay
552 B762 N329AA 23:03:35 23:10 * 31L* LAX /JFK 0
563 B763 N1611B 23:17:37 23:46 31L KATL /KJFK 0
565 B752 N654DL 23:24:04 23:48 31L LAS /JFK 0
Operator dq gw
552 AMERICAN AIRLINES 2009-01-01 92 1
563 DELTA AIR LINES 2009-01-01 93 1
565 DELTA AIR LINES 2009-01-01 93 1
# j = 2 Runway=31R
Browse[1] ar4rw[1:3,]
DateTime Date month hour minute quarter weekday IATA ICAO Flight
529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570
530 1/1/09 21:48 2009-01-01 1 21 48 87 5 AA AAL AAL2018
531 1/1/09 21:50 2009-01-01 1 21 50 87 5 BA BAW BAW183
AircraftType Tail Arrived STA Runway FromTo Delay
529 A320 N496TA 21:46:58 22:30 * 31R* MSLP /KJFK 0
530 B752 N621AM 21:48:43 21:50 31R TLPL /JFK 0
531 B744 G-CIVI 21:50:26 22:50 31R EGLL /KJFK 0
Operator dq gw
529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1
530 AMERICAN AIRLINES 2009-01-01 87 1
531 BRITISH AIRWAYS 2009-01-01 87 1
# But the merge gives all NAs!
] arrw[1:3,]
DateTime Date month hour minute quarter weekday IATA ICAO Flight
NA NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA
AircraftType Tail Arrived STA Runway FromTo Delay Operator dq gw
NA NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA

Thanks,
Jim Rome

On Feb 1, 2010, at 5:30 PM, David Winsemius wrote:



On Feb 1, 2010, at 5:16 PM, James Rome wrote:


Dear kind R helpers,

I have a vector of runway names in rwy (31R, 31L,... the number
is user selectable)
arrgnd is a data frame with data for all flights and all runways,
with a Runway column.
I am trying to subset arrgnd into a dat frame for each selected
runway, and then combine them back together using the following code:

for (j in 1:nr) { # nr = number of user-selected runways


Safer would be:

for (j in seq_along(rwy) {


ar4rw = arrgnd[arrgnd$Runway==rwy[j],]


Clearer would be :

ar4rw - subset(arrgnd, Runway= j) # and I think the NA line's will
also disappear.

^ == ^




if (j == 1) {
arrw = ar4rw
}
else {
arrw = merge(arrw, ar4rw)
}
}


You really should give us something like:

dput(rwy)
dput( head(arrgnd, 10) )


but, the merge step gives me a data frame with all NAs. In addition,
ar4rw always gets a row with NAs at the start, which I do not
understand. There are no rows with all NAs in the arrgnd data frame.
 ar4rw[1:2,] # first time through for 31R
DateTime Date month hour minute quarter weekday IATA ICAO Flight
NA NA NA NA NA NA NA NA NA NA NA
529 1/1/09 21:46 2009-01-01 1 21 46 87 5 TA TAI TAI570
AircraftType Tail Arrived STA Runway FromTo Delay
NA NA NA NA NA NA NA NA
529 A320 N496TA 21:46:58 22:30 31R MSLP /KJFK 0
Operator dq gw
NA NA NA NA
529 TACA INTERNATIONAL AIRLINES 2009-01-01 87 1

 ar4rw[1:2,] # second time through for 31L
DateTime Date month hour minute quarter weekday IATA ICAO Flight
NA NA NA NA NA NA NA NA NA NA NA
552 1/1/09 23:03 2009-01-01 1 23 3 92 5 AA AAL AAL22
AircraftType Tail Arrived STA Runway FromTo Delay Operator
NA NA NA NA NA NA NA NA NA
552 B762 N329AA 23:03:35 23:10 31L LAX /JFK 0 AMERICAN AIRLINES
dq gw
NA NA NA

But after the merge, I get all NAs. What am I doing wrong?


The data layout gets mangled and I cannot tell what rows are being
matched to what. Use dput to convey an unambiguous, and easily
replicated example.


Thanks,
Jim Rome

552 2009-01-01 92 1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-02 Thread James Rome

On 2/1/2010 5:51 PM, David Winsemius wrote:
I figured this out finally. I really believe that the R help write-ups
are sorely lacking. As soon as I looked at
http://www.statmethods.net/management/merging.html, it was obvious:


Adding Columns

To merge two dataframes (datasets) horizontally, use the *merge*
function. In most cases, you join two dataframes by one or more common
key variables (i.e., an inner join).

|# merge two dataframes by ID
total - merge(dataframeA,dataframeB,by=ID)|

|# merge two dataframes by ID and Country
total - merge(dataframeA,dataframeB,by=c(ID,Country)) |


Adding Rows

To join two dataframes (datasets) vertically, use the* rbind* function.
The two dataframes *must* have the same variables, but they do not have
to be in the same order.

|total - rbind(dataframeA, dataframeB) |

I needed to add rows, and had to use rbind. If the help for merge said
To merge two dataframes (datasets) horizontally I would have known
right away that it was the wrong function to use.

Thanks for the help,
Jim Rome


On Feb 1, 2010, at 5:30 PM, David Winsemius wrote:


 On Feb 1, 2010, at 5:16 PM, James Rome wrote:

 Dear kind R helpers,

 I have a vector of runway names in rwy  (31R, 31L,...  the number
 is user selectable)
 arrgnd is a data frame with data for all flights and all runways,
 with a Runway column.
 I am trying to subset arrgnd into a dat frame for each selected
 runway, and then combine them back together using the following code:

 for (j in 1:nr) {# nr = number of user-selected runways

 Safer would be:

 for (j in seq_along(rwy) {

   ar4rw = arrgnd[arrgnd$Runway==rwy[j],]

 Clearer would be :

ar4rw - subset(arrgnd, Runway= j) # and I think the NA line's
 will also disappear.
 ^ ==  ^


   if (j == 1) {
   arrw = ar4rw
   }
   else {
   arrw = merge(arrw, ar4rw)
   }
 }

 You really should give us something like:

 dput(rwy)
 dput( head(arrgnd, 10) )

 but, the merge step gives me a data frame with all NAs. In addition,
 ar4rw always gets a row with NAs at the start, which I do not
 understand. There are no rows with all NAs in the arrgnd data frame.
  ar4rw[1:2,]  # first time through for 31R
   DateTime   Date month hour minute quarter weekday IATA ICAO
 Flight
 NA NA NANA   NA NA  NA  NA NA NA NA
 529 1/1/09 21:46 2009-01-01 1   21 46  87   5   TA 
 TAI TAI570
   AircraftType   Tail  Arrived   STA Runway FromTo Delay
 NA NA NA NA NA NA NANA
 529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0
  Operatordq gw
 NA NA NA NA
 529 TACA INTERNATIONAL AIRLINES 2009-01-01 87  1

  ar4rw[1:2,]   # second time through for 31L
   DateTime   Date month hour minute quarter weekday IATA ICAO
 Flight
 NA NA NANA   NA NA  NA  NA NA NA NA
 552 1/1/09 23:03 2009-01-01 1   23  3  92   5   AA 
 AAL  AAL22
   AircraftType   Tail  Arrived   STA RunwayFromTo Delay 
 Operator
 NA NA NA NA NA NA NANA NA
 552 B762 N329AA 23:03:35 23:1031L LAX  /JFK 0
 AMERICAN AIRLINES
  dq gw
 NA NA NA

 But after the merge, I get all NAs. What am I doing wrong?

 The data layout gets mangled and I cannot tell what rows are being
 matched to what. Use dput to convey an unambiguous, and easily
 replicated example.

 Thanks,
 Jim Rome

 552 2009-01-01 92  1

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 David Winsemius, MD
 Heritage Laboratories
 West Hartford, CT

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-02 Thread Erik Iverson




James Rome wrote:

On 2/1/2010 5:51 PM, David Winsemius wrote:
I figured this out finally. I really believe that the R help write-ups
are sorely lacking. 


The help docs are probably not the best way to learn R, but they are 
great for users of the functions.  I have found that after going through 
an introduction book on R or online tutorial (plus experience), that the 
help system in R is really, really good at *documenting the behavior of 
the functions*, which is the point of them.


As another general hint from someone who has learned R slowly over time, 
when something happens that you don't understand on real data, 
construct a minimal example data.frame and try out your code on that. 
Also, learning how to use browser() or the debug package has been very 
useful.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-02 Thread James Rome

I agree. I have a foot of books on R now, for example the R Book by
Michael Crowly. But so far, Googling the archives of this list has been
the most help. Nonetheless, if I cannot understand the documentation of
a function, then the documentation needs to be updated. For example,
there needs to be a Returns section at the top of every function, so
one can see what type of thing the function returns.

Merge() needs to start with To merge two dataframes (datasets)
horizontally, use the *merge* function. rather than

Merge two data frames by common columns or row names, or do other
versions of database /join/ operations which does not at all say that
it does a horizontal merge if one does not know SQL. I do know SQL, and
it is still not clear to me. And the/ merge/ documentation should then
refer users to/ rbind/ for vertical merges.

I hope that someone on the list can take actually change this file for
the benefit of others.

Thanks,
Jim


On 2/2/2010 2:00 PM, Erik Iverson wrote:

James Rome wrote:
 On 2/1/2010 5:51 PM, David Winsemius wrote:
 I figured this out finally. I really believe that the R help write-ups
 are sorely lacking. 

The help docs are probably not the best way to learn R, but they are
great for users of the functions.  I have found that after going through
an introduction book on R or online tutorial (plus experience), that the
help system in R is really, really good at *documenting the behavior of
the functions*, which is the point of them.

As another general hint from someone who has learned R slowly over time,
when something happens that you don't understand on real data,
construct a minimal example data.frame and try out your code on that.
Also, learning how to use browser() or the debug package has been very
useful.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-02 Thread David Winsemius

Yeah, sometimes the vocabulary we bring to a task does not match up  
(or merge properly) with the vocabulary that the developers use. In  
this case the merge operation is one that has a precise meaning in  
database lingo, which apparently you do not have background in.  My  
experience in trying to append objects ran into similar frustrations  
early in my R endeavors. For the life of me, I could not find any  
instances of append in the index of the references I was using.


I am glad that you found that material helpful, but I think its use of  
the terms join or merge  are incorrect in a database framework as  
well, so I do not think it could be used as an unambiguous guide. Your  
use of combine was likewise ambiguous. In composing questions to R- 
help, it is advised that you post a small example and illustrate what  
you want to see as a result.


--
David.



On Feb 2, 2010, at 1:47 PM, James Rome wrote:


On 2/1/2010 5:51 PM, David Winsemius wrote:
I figured this out finally. I really believe that the R help write- 
ups are sorely lacking.


You should ponder whether you actually know enough to criticize the  
help page when it describes the merge function as performing database  
join operations. My guess is that you don't. The help page are not to  
be designed to teach basic computer programming concepts.




As soon as I looked at http://www.statmethods.net/management/merging.html 
, it was obvious:

Adding Columns
To merge two dataframes (datasets) horizontally, use the merge  
function. In most cases, you join two dataframes by one or more  
common key variables (i.e., an inner join).


# merge two dataframes by ID
total - merge(dataframeA,dataframeB,by=ID)

# merge two dataframes by ID and Country
total - merge(dataframeA,dataframeB,by=c(ID,Country))

Adding Rows
To join two dataframes (datasets) vertically, use the rbind  
function. The two dataframes must have the same variables, but they  
do not have to be in the same order.


total - rbind(dataframeA, dataframeB)

I needed to add rows, and had to use rbind. If the help for merge  
said To merge two dataframes (datasets) horizontally I would have  
known right away that it was the wrong function to use.


Thanks for the help,
Jim Rome


On Feb 1, 2010, at 5:30 PM, David Winsemius wrote:



On Feb 1, 2010, at 5:16 PM, James Rome wrote:


Dear kind R helpers,

I have a vector of runway names in rwy  (31R, 31L,...  the  
number is user selectable)
arrgnd is a data frame with data for all flights and all runways,  
with a Runway column.
I am trying to subset arrgnd into a dat frame for each selected  
runway, and then combine them back together using the following  
code:


for (j in 1:nr) {# nr = number of user-selected runways


Safer would be:

for (j in seq_along(rwy) {


  ar4rw = arrgnd[arrgnd$Runway==rwy[j],]


Clearer would be :

   ar4rw - subset(arrgnd, Runway= j) # and I think the NA  
line's will also disappear.

 ^ ==  ^




  if (j == 1) {
  arrw = ar4rw
  }
  else {
  arrw = merge(arrw, ar4rw)
  }
}


You really should give us something like:

dput(rwy)
dput( head(arrgnd, 10) )


but, the merge step gives me a data frame with all NAs. In  
addition, ar4rw always gets a row with NAs at the start, which I  
do not understand. There are no rows with all NAs in the arrgnd  
data frame.

 ar4rw[1:2,]  # first time through for 31R
  DateTime   Date month hour minute quarter weekday IATA  
ICAO Flight

NA NA NANA   NA NA  NA  NA NA NA NA
529 1/1/09 21:46 2009-01-01 1   21 46  87   5
TA  TAI TAI570

  AircraftType   Tail  Arrived   STA Runway FromTo Delay
NA NA NA NA NA NA NANA
529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0
 Operatordq gw
NA NA NA NA
529 TACA INTERNATIONAL AIRLINES 2009-01-01 87  1

 ar4rw[1:2,]   # second time through for 31L
  DateTime   Date month hour minute quarter weekday IATA  
ICAO Flight

NA NA NANA   NA NA  NA  NA NA NA NA
552 1/1/09 23:03 2009-01-01 1   23  3  92   5
AA  AAL  AAL22
  AircraftType   Tail  Arrived   STA RunwayFromTo  
Delay  Operator

NA NA NA NA NA NA NANA NA
552 B762 N329AA 23:03:35 23:1031L LAX  /JFK 0  
AMERICAN AIRLINES

 dq gw
NA NA NA

But after the merge, I get all NAs. What am I doing wrong?


The data layout gets mangled and I cannot tell what rows are being  
matched to what. Use dput to convey an unambiguous, and easily  
replicated example.


Thanks,
Jim Rome

552 2009-01-01 92  1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

[R] merging data frames gives all NAs

2010-02-01 Thread James Rome


Dear kind R helpers,

I have a vector of runway names in rwy  (31R, 31L,...  the number is 
user selectable)
arrgnd is a data frame with data for all flights and all runways, with a 
Runway column.
I am trying to subset arrgnd into a dat frame for each selected runway, 
and then combine them back together using the following code:


for (j in 1:nr) {# nr = number of user-selected runways
ar4rw = arrgnd[arrgnd$Runway==rwy[j],]
if (j == 1) {
arrw = ar4rw
}
else {
arrw = merge(arrw, ar4rw)
}
}

but, the merge step gives me a data frame with all NAs. In addition, 
ar4rw always gets a row with NAs at the start, which I do not 
understand. There are no rows with all NAs in the arrgnd data frame.

 ar4rw[1:2,]  # first time through for 31R
DateTime   Date month hour minute quarter weekday IATA ICAO 
Flight

NA NA NANA   NA NA  NA  NA NA NA NA
529 1/1/09 21:46 2009-01-01 1   21 46  87   5   TA  TAI 
TAI570

AircraftType   Tail  Arrived   STA Runway FromTo Delay
NA NA NA NA NA NA NANA
529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0
   Operatordq gw
NA NA NA NA
529 TACA INTERNATIONAL AIRLINES 2009-01-01 87  1

 ar4rw[1:2,]   # second time through for 31L
DateTime   Date month hour minute quarter weekday IATA ICAO 
Flight

NA NA NANA   NA NA  NA  NA NA NA NA
552 1/1/09 23:03 2009-01-01 1   23  3  92   5   AA  AAL  
AAL22
AircraftType   Tail  Arrived   STA RunwayFromTo Delay  
Operator

NA NA NA NA NA NA NANA NA
552 B762 N329AA 23:03:35 23:1031L LAX  /JFK 0 AMERICAN 
AIRLINES

   dq gw
NA NA NA

But after the merge, I get all NAs. What am I doing wrong?

Thanks,
Jim Rome

552 2009-01-01 92  1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames gives all NAs

2010-02-01 Thread David Winsemius



On Feb 1, 2010, at 5:16 PM, James Rome wrote:


Dear kind R helpers,

I have a vector of runway names in rwy  (31R, 31L,...  the  
number is user selectable)
arrgnd is a data frame with data for all flights and all runways,  
with a Runway column.
I am trying to subset arrgnd into a dat frame for each selected  
runway, and then combine them back together using the following code:


for (j in 1:nr) {# nr = number of user-selected runways


Safer would be:

 for (j in seq_along(rwy) {


   ar4rw = arrgnd[arrgnd$Runway==rwy[j],]


Clearer would be :

ar4rw - subset(arrgnd, Runway=j) # and I think the NA line's  
will also disappear.




   if (j == 1) {
   arrw = ar4rw
   }
   else {
   arrw = merge(arrw, ar4rw)
   }
}


You really should give us something like:

dput(rwy)
dput( head(arrgnd, 10) )


but, the merge step gives me a data frame with all NAs. In addition,  
ar4rw always gets a row with NAs at the start, which I do not  
understand. There are no rows with all NAs in the arrgnd data frame.

 ar4rw[1:2,]  # first time through for 31R
   DateTime   Date month hour minute quarter weekday IATA  
ICAO Flight

NA NA NANA   NA NA  NA  NA NA NA NA
529 1/1/09 21:46 2009-01-01 1   21 46  87   5   TA   
TAI TAI570

   AircraftType   Tail  Arrived   STA Runway FromTo Delay
NA NA NA NA NA NA NANA
529 A320 N496TA 21:46:58 22:3031R MSLP /KJFK 0
  Operatordq gw
NA NA NA NA
529 TACA INTERNATIONAL AIRLINES 2009-01-01 87  1

 ar4rw[1:2,]   # second time through for 31L
   DateTime   Date month hour minute quarter weekday IATA  
ICAO Flight

NA NA NANA   NA NA  NA  NA NA NA NA
552 1/1/09 23:03 2009-01-01 1   23  3  92   5   AA   
AAL  AAL22
   AircraftType   Tail  Arrived   STA RunwayFromTo  
Delay  Operator

NA NA NA NA NA NA NANA NA
552 B762 N329AA 23:03:35 23:1031L LAX  /JFK 0  
AMERICAN AIRLINES

  dq gw
NA NA NA

But after the merge, I get all NAs. What am I doing wrong?


The data layout gets mangled and I cannot tell what rows are being  
matched to what. Use dput to convey an unambiguous, and easily  
replicated example.


Thanks,
Jim Rome

552 2009-01-01 92  1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] merging data frames with matrix objects when missing cases

2009-09-18 Thread Kari Ruohonen

Hi,
I have faced a problem with the merge() function when trying to merge
two data frames that have a common index but the second one does not
have cases for all indexes in the first one. With usual variables R
fills in the missing cases with NA if all=T is requested. But if the
variable is a matrix R seems to insert NA only to the first column of
the matrix and fill in the rest of the columns by recycling the values.
Here is a toy example:

 df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2)))
 df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2)))
 merge(df1,df2)
  a X1.1 X1.2 X2.1 X2.2
1 114   11   13
2 225   12   14  
# no all=T, missing cases are dropped

 merge(df1,df2,all=T)
  a X1.1 X1.2 X2.1 X2.2
1 114   11   13
2 225   12   14
3 336   NA   13 
# X2.1 set to NA correctly but X2.2 set to 13 by recycling.

Can I somehow get the behaviour that the third row of the second matrix
X2 in the above example would be filled with NA for all columns? None of
the merge() options does not seem to provide a solution.

regards, Kari

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames with matrix objects when missing cases

2009-09-18 Thread johannes rara

This has something to do with your data.frame structure

see

 str(df1)
'data.frame':   3 obs. of  2 variables:
 $ a : int  1 2 3
 $ X1: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6
 str(df2)
'data.frame':   2 obs. of  2 variables:
 $ a : int  1 2
 $ X2: 'AsIs' int [1:2, 1:2] 11 12 13 14

This seems to work

 df1-data.frame(a=1:3, b = 1:3, c = 4:6)
 str(df1)
'data.frame':   3 obs. of  3 variables:
 $ a: int  1 2 3
 $ b: int  1 2 3
 $ c: int  4 5 6
 df2-data.frame(a=1:2, d = 11:12, e = 13:14)
 str(df2)
'data.frame':   2 obs. of  3 variables:
 $ a: int  1 2
 $ d: int  11 12
 $ e: int  13 14
 merge(df1,df2)
  a b c  d  e
1 1 1 4 11 13
2 2 2 5 12 14
 merge(df1, df2, all=T)
  a b c  d  e
1 1 1 4 11 13
2 2 2 5 12 14
3 3 3 6 NA NA


2009/9/18 Kari Ruohonen kari.ruoho...@utu.fi:
 Hi,
 I have faced a problem with the merge() function when trying to merge
 two data frames that have a common index but the second one does not
 have cases for all indexes in the first one. With usual variables R
 fills in the missing cases with NA if all=T is requested. But if the
 variable is a matrix R seems to insert NA only to the first column of
 the matrix and fill in the rest of the columns by recycling the values.
 Here is a toy example:

 df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2)))
 df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2)))
 merge(df1,df2)
  a X1.1 X1.2 X2.1 X2.2
 1 1    1    4   11   13
 2 2    2    5   12   14
 # no all=T, missing cases are dropped

 merge(df1,df2,all=T)
  a X1.1 X1.2 X2.1 X2.2
 1 1    1    4   11   13
 2 2    2    5   12   14
 3 3    3    6   NA   13
 # X2.1 set to NA correctly but X2.2 set to 13 by recycling.

 Can I somehow get the behaviour that the third row of the second matrix
 X2 in the above example would be filled with NA for all columns? None of
 the merge() options does not seem to provide a solution.

 regards, Kari

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames with matrix objects when missing cases

2009-09-18 Thread Kari Ruohonen

Yes, that was the original question: when a variable in a data frame is
a matrix instead of an ordinary variable merge() handles the missing
cases so that only the first column of the matrix gets NA and the rest
are recycled. If the matrix is broken to several variables everything
works fine.

Why then have a matrix in a data frame as a variable? In chemometrics,
for example, it is usual to have e.g. NIR spectra stored in the data
frame in this way. This eases the use of such spectra as a predictor in
the model formula (may contain hundreds of variables depending on the
wavelength binning used). It is also helpful in grouping variables in a
data frame to different predictor sets. See examples in the pls
package. 

There is a workout by searching the NA for the first column and setting
all other columns on that row NA as well. But my question was more like
a caution about the unexpected behaviour that someone could consider as
an unwished feature.

Kari

On Fri, 2009-09-18 at 20:41 +0300, johannes rara wrote:
 This has something to do with your data.frame structure
 
 see
 
  str(df1)
 'data.frame': 3 obs. of  2 variables:
  $ a : int  1 2 3
  $ X1: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6
  str(df2)
 'data.frame': 2 obs. of  2 variables:
  $ a : int  1 2
  $ X2: 'AsIs' int [1:2, 1:2] 11 12 13 14
 
 This seems to work
 
  df1-data.frame(a=1:3, b = 1:3, c = 4:6)
  str(df1)
 'data.frame': 3 obs. of  3 variables:
  $ a: int  1 2 3
  $ b: int  1 2 3
  $ c: int  4 5 6
  df2-data.frame(a=1:2, d = 11:12, e = 13:14)
  str(df2)
 'data.frame': 2 obs. of  3 variables:
  $ a: int  1 2
  $ d: int  11 12
  $ e: int  13 14
  merge(df1,df2)
   a b c  d  e
 1 1 1 4 11 13
 2 2 2 5 12 14
  merge(df1, df2, all=T)
   a b c  d  e
 1 1 1 4 11 13
 2 2 2 5 12 14
 3 3 3 6 NA NA
 
 
 2009/9/18 Kari Ruohonen kari.ruoho...@utu.fi:
  Hi,
  I have faced a problem with the merge() function when trying to merge
  two data frames that have a common index but the second one does not
  have cases for all indexes in the first one. With usual variables R
  fills in the missing cases with NA if all=T is requested. But if the
  variable is a matrix R seems to insert NA only to the first column of
  the matrix and fill in the rest of the columns by recycling the values.
  Here is a toy example:
 
  df1-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2)))
  df2-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2)))
  merge(df1,df2)
   a X1.1 X1.2 X2.1 X2.2
  1 114   11   13
  2 225   12   14
  # no all=T, missing cases are dropped
 
  merge(df1,df2,all=T)
   a X1.1 X1.2 X2.1 X2.2
  1 114   11   13
  2 225   12   14
  3 336   NA   13
  # X2.1 set to NA correctly but X2.2 set to 13 by recycling.
 
  Can I somehow get the behaviour that the third row of the second matrix
  X2 in the above example would be filled with NA for all columns? None of
  the merge() options does not seem to provide a solution.
 
  regards, Kari
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-27 Thread stephenb


You are exceeding your max memory here, so R will not be able to do that. 
dump both tables into a db such as mysql and then run the query either from
RMySQL or from mysql directly. then output the result and import back in R.

that will take care of the merge, but not sure what will happen when you
actually try to run some stats on the object. it is very likely the
operation will exceed memory again.

in the end you may have to write your own code which does not attempt to
load everything in memory, it could be either R or a lower level language.

if you have SAS it will probably work as it deals with large sets in long
format well. depending on what you do R may be able to deal with it after a
reshape() to a wide format.


joe1985 wrote:
 
 Hello
 
 I have two data frames, SNP4 and SNP1:
 
 head(SNP4)
   Animal MarkerY
 3213 194073197  P1001 0.021088
 1295 194073197  P1002 0.021088
 915   194073197  P1004 0.021088
 2833 194073197  P1005 0.021088
 1487 194073197  P1006 0.021088
 1885 194073197  P1007 0.021088
 
 head(SNP1)
AnimalMarker x
 3213 194073197  P1001 2
 1295 194073197  P1002 1
 915   194073197  P1004 2
 2833 194073197  P1005 0
 1487 194073197  P1006 2
 1885 194073197  P1007 0
 
 I want these two data frames merged by 'Marker', but when i try 
 
 SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE)
 Error: cannot allocate vector of size 2.4 Gb
 In addition: Warning messages:
 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
   Reached total allocation of 1535Mb: see help(memory.size)
 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
   Reached total allocation of 1535Mb: see help(memory.size)
 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
   Reached total allocation of 1535Mb: see help(memory.size)
 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
   Reached total allocation of 1535Mb: see help(memory.size)
 
 And error occurs.
 
 What i want is the column SNP1$x merged together with SNP4 by Marker, so
 some markers will have NA's in the 'x'-column in the SNP5 dataset.
 
 I also tried this
 
 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) 
 Error in fix.by(by.y, y) : 'by' must specify valid column(s)
 
 I won't work either. 
 
 Does anyone have any idea how to solve this.
 
 Regards,
 
 Johannes.
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Merging-data-frames%2C-or-one-column-vector-with-a-data-frame-filling-out-empty-rows-with-NA%27s-tp23171110p23259062.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-22 Thread joe1985


Hello

I have two data frames, SNP4 and SNP1:

 head(SNP4)
  Animal MarkerY
3213 194073197  P1001 0.021088
1295 194073197  P1002 0.021088
915   194073197  P1004 0.021088
2833 194073197  P1005 0.021088
1487 194073197  P1006 0.021088
1885 194073197  P1007 0.021088

 head(SNP1)
   AnimalMarker x
3213 194073197  P1001 2
1295 194073197  P1002 1
915   194073197  P1004 2
2833 194073197  P1005 0
1487 194073197  P1006 2
1885 194073197  P1007 0

I want these two data frames merged by 'Marker', but when i try 

 SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE)
Error: cannot allocate vector of size 2.4 Gb
In addition: Warning messages:
1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)

And error occurs.

What i want is the column SNP1$x merged together with SNP4 by Marker, so
some markers will have NA's in the 'x'-column in the SNP5 dataset.

I also tried this

 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) 
Error in fix.by(by.y, y) : 'by' must specify valid column(s)

I won't work either. 

Does anyone have any idea how to solve this.

Regards,

Johannes.




-- 
View this message in context: 
http://www.nabble.com/Merging-data-frames%2C-or-one-column-vector-with-a-data-frame-filling-out-empty-rows-with-NA%27s-tp23171110p23171110.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-22 Thread Johannes G. Madsen

Hello 

I have two data frames, SNP4 and SNP1: 

 head(SNP4) 
  Animal MarkerY 
3213 194073197  P1001 0.021088 
1295 194073197  P1002 0.021088 
915   194073197  P1004 0.021088 
2833 194073197  P1005 0.021088 
1487 194073197  P1006 0.021088 
1885 194073197  P1007 0.021088 

 head(SNP1) 
   AnimalMarker x 
3213 194073197  P1001 2 
1295 194073197  P1002 1 
915   194073197  P1004 2 
2833 194073197  P1005 0 
1487 194073197  P1006 2 
1885 194073197  P1007 0 

I want these two data frames merged by 'Marker', but when i try 

 SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE) 
Error: cannot allocate vector of size 2.4 Gb 
In addition: Warning messages: 
1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : 
  Reached total allocation of 1535Mb: see help(memory.size) 
2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : 
  Reached total allocation of 1535Mb: see help(memory.size) 
3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : 
  Reached total allocation of 1535Mb: see help(memory.size) 
4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) : 
  Reached total allocation of 1535Mb: see help(memory.size) 

And error occurs. 

What i want is the column SNP1$x merged together with SNP4 by Marker, so some
markers will have NA's in the 'x'-column in the SNP5 dataset. 

I also tried this 

 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE) 
Error in fix.by(by.y, y) : 'by' must specify valid column(s) 

I won't work either. 

Does anyone have any idea how to solve this. 

Regards, 

Johannes. 

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-22 Thread Gabor Grothendieck

Try this (where SNP1x is same as SNP1 from your post
but without the last line).  If the merge below does not work
on real data set due to size then try the sqldf alternative
as it

 SNP1x -
+ structure(list(Animal = c(194073197L, 194073197L, 194073197L,
+ 194073197L, 194073197L), Marker = structure(1:5, .Label = c(P1001,
+ P1002, P1004, P1005, P1006, P1007), class = factor),
+ x = c(2L, 1L, 2L, 0L, 2L)), .Names = c(Animal, Marker,
+ x), row.names = c(3213, 1295, 915, 2833, 1487), class =
data.frame)

 SNP4 -
+ structure(list(Animal = c(194073197L, 194073197L, 194073197L,
+ 194073197L, 194073197L, 194073197L), Marker = structure(1:6, .Label
= c(P1001,
+ P1002, P1004, P1005, P1006, P1007), class = factor),
+ Y = c(0.021088, 0.021088, 0.021088, 0.021088, 0.021088, 0.021088
+ )), .Names = c(Animal, Marker, Y), class = data.frame,
row.names = c(3213,
+ 1295, 915, 2833, 1487, 1885))

 merge(SNP1x, SNP4, all = TRUE)
 Animal Marker  xY
1 194073197  P1001  2 0.021088
2 194073197  P1002  1 0.021088
3 194073197  P1004  2 0.021088
4 194073197  P1005  0 0.021088
5 194073197  P1006  2 0.021088
6 194073197  P1007 NA 0.021088
 library(sqldf)
 sqldf(select * from SNP4 left join SNP1x using (Animal, Marker))
 Animal MarkerY  x
1 194073197  P1001 0.021088  2
2 194073197  P1002 0.021088  1
3 194073197  P1004 0.021088  2
4 194073197  P1005 0.021088  0
5 194073197  P1006 0.021088  2
6 194073197  P1007 0.021088 NA
 # or if that does not work due to size force it to create, use
 #and destroy an external data base
 sqldf(select * from SNP4 left join SNP1x using (Animal, Marker), dbname = 
 temp.db)
 Animal MarkerY  x
1 194073197  P1001 0.021088  2
2 194073197  P1002 0.021088  1
3 194073197  P1004 0.021088  2
4 194073197  P1005 0.021088  0
5 194073197  P1006 0.021088  2
6 194073197  P1007 0.021088 NA



On Wed, Apr 22, 2009 at 5:22 AM, Johannes G. Madsen
j...@dansksvineproduktion.dk wrote:
 Hello

 I have two data frames, SNP4 and SNP1:

 head(SNP4)
          Animal     Marker        Y
 3213 194073197  P1001 0.021088
 1295 194073197  P1002 0.021088
 915   194073197  P1004 0.021088
 2833 194073197  P1005 0.021088
 1487 194073197  P1006 0.021088
 1885 194073197  P1007 0.021088

 head(SNP1)
           Animal    Marker x
 3213 194073197  P1001 2
 1295 194073197  P1002 1
 915   194073197  P1004 2
 2833 194073197  P1005 0
 1487 194073197  P1006 2
 1885 194073197  P1007 0

 I want these two data frames merged by 'Marker', but when i try

 SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE)
 Error: cannot allocate vector of size 2.4 Gb
 In addition: Warning messages:
 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)

 And error occurs.

 What i want is the column SNP1$x merged together with SNP4 by Marker, so some
 markers will have NA's in the 'x'-column in the SNP5 dataset.

 I also tried this

 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE)
 Error in fix.by(by.y, y) : 'by' must specify valid column(s)

 I won't work either.

 Does anyone have any idea how to solve this.

 Regards,

 Johannes.

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-22 Thread David Winsemius



On Apr 22, 2009, at 5:22 AM, Johannes G. Madsen wrote:


Hello

I have two data frames, SNP4 and SNP1:


head(SNP4)

 Animal MarkerY
3213 194073197  P1001 0.021088
1295 194073197  P1002 0.021088
915   194073197  P1004 0.021088
2833 194073197  P1005 0.021088
1487 194073197  P1006 0.021088
1885 194073197  P1007 0.021088


head(SNP1)

  AnimalMarker x
3213 194073197  P1001 2
1295 194073197  P1002 1
915   194073197  P1004 2
2833 194073197  P1005 0
1487 194073197  P1006 2
1885 194073197  P1007 0

I want these two data frames merged by 'Marker', but when i try


SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE)

Error: cannot allocate vector of size 2.4 Gb
In addition: Warning messages:
1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)
2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)
3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)
4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
 Reached total allocation of 1535Mb: see help(memory.size)

And error occurs.


So what are the results of:

str(SNP4) ; str(SNP1)# this will tell us how large these objects  
are.


And are you sure you don't want the merge to occur by Animal as well?




What i want is the column SNP1$x merged together with SNP4 by  
Marker, so some

markers will have NA's in the 'x'-column in the SNP5 dataset.


I also tried this

SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all =  
TRUE)

Error in fix.by(by.y, y) : 'by' must specify valid column(s)

I won't work either.

Does anyone have any idea how to solve this.


The second error seems pretty obvious. You are trying to merge a  
vector that has no longer any Marker with a dataframe that does.



Regards,

Johannes.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames, or one column/vector with a data frame filling out empty rows with NA's

2009-04-22 Thread Sarah Goslee

Hi,

How about this:

 SNP5 - merge(SNP4, SNP1[,2:3], all.x=TRUE)
 SNP5
  MarkerAnimal  Y x
1  P1001 194073197 0.021088 2
2  P1002 194073197 0.021088 1
3  P1004 194073197 0.021088 2
4  P1005 194073197 0.021088 0
5  P1006 194073197 0.021088 2
6  P1007 194073197 0.021088 0

This ignores Animal, and that may or may not be what you want -
it wasn't clear from your question.

But your error is due to memory limitations - could be due to
specifying the wrong merge, or to having files larger than your
computer can handle. This is a good job for a proper database.

 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE)
 Error in fix.by(by.y, y) : 'by' must specify valid column(s)

If you just include SNP1$x, there is no Marker column to merge on. You
need to include at least two columns.

On Wed, Apr 22, 2009 at 3:30 AM, joe1985 johan...@dsr.life.ku.dk wrote:

 Hello

 I have two data frames, SNP4 and SNP1:

 head(SNP4)
          Animal     Marker        Y
 3213 194073197  P1001 0.021088
 1295 194073197  P1002 0.021088
 915   194073197  P1004 0.021088
 2833 194073197  P1005 0.021088
 1487 194073197  P1006 0.021088
 1885 194073197  P1007 0.021088

 head(SNP1)
           Animal    Marker x
 3213 194073197  P1001 2
 1295 194073197  P1002 1
 915   194073197  P1004 2
 2833 194073197  P1005 0
 1487 194073197  P1006 2
 1885 194073197  P1007 0

 I want these two data frames merged by 'Marker', but when i try

 SNP5 - merge(SNP4, SNP1, by = 'Marker', all = TRUE)
 Error: cannot allocate vector of size 2.4 Gb
 In addition: Warning messages:
 1: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 2: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 3: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
 4: In merge.data.frame(SNP4, SNP1, by = Marker, all = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)

 And error occurs.

 What i want is the column SNP1$x merged together with SNP4 by Marker, so
 some markers will have NA's in the 'x'-column in the SNP5 dataset.

 I also tried this

 SNP5 - merge(SNP4, SNP1$x, by.x = 'Marker', by.y = 'Marker', all = TRUE)
 Error in fix.by(by.y, y) : 'by' must specify valid column(s)

 I won't work either.

 Does anyone have any idea how to solve this.

 Regards,

 Johannes.







-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames of different length

2008-12-20 Thread Dimitri Liakhovitski

Thanks a lot, Gabor - it's perfect!
Dimitri

On Fri, Dec 19, 2008 at 6:24 PM, Gabor Grothendieck
ggrothendi...@gmail.com wrote:
 Try this:

 L - list(data.frame(A=2, B=3, C=4),
 + data.frame(A=2, B=1, C=3, D=2, E=4, F=5),
 + data.frame(A=1, B=2, C=4, D=3, E=2, F=4, G=5, H=4, I=2))

 library(plyr)
 do.call(rbind.fill, L)
  A B C  D  E  F  G  H  I
 1 2 3 4 NA NA NA NA NA NA
 2 2 1 3  2  4  5 NA NA NA
 3 1 2 4  3  2  4  5  4  2


 On Fri, Dec 19, 2008 at 5:48 PM, Dimitri Liakhovitski ld7...@gmail.com 
 wrote:
 Hello, everyone!

 I have list L that contains 99 data frames. All data frames have only
 one row, but a different number of columns. Some data frames have 3
 columns, some - 6 columns, and some - 9 columns. The names of the
 first 3 columns are identical in all 99 data frames (e.g., A, B, and
 C). The names of columns 4:6 are identical in data frames that contain
 6 and 9 columns (e.g., D, E, and F). So that L looks like this:

 L[[1]]
 A B C
 2 3 4
 L[[2]]
 A B C D E F
 2 1 3 2 4 5
 L[[3]]
 A B C D E F G H I
 1 2 4 3 2 4 5 4 2
 L[[4]]
 ...


 How can I merge all of those data frames into one large data frame -
 with 99 rows - such that all data are in the columns with correct
 names. Of course, I'd like the rows of the new large data frame that
 contain the data for less than 9 columns to have NAs in columns 4:9
 (or 7:9).
 In other words, I want the first 3 rows of the new large data frame to
 look like this:
 A B C D E F G H I
 2 3 4 NA NA NA NA NA NA
 2 1 3 2 4 5 NA NA NA
 1 2 4 3 2 4 5 4 2

 Ideally, I'd like this merge to work for ANY number of individual
 small data frames in L - even if their total number within L is
 unknown.

 I tried merge - but it seems to me that it only works for 2 data
 frames, not for many.
 Thank you very much!
 --
 Dimitri Liakhovitski
 MarketTools, Inc.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 
Dimitri Liakhovitski
MarketTools, Inc.
dimitri.liakhovit...@markettools.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Merging data frames of different length

2008-12-19 Thread Dimitri Liakhovitski

Hello, everyone!

I have list L that contains 99 data frames. All data frames have only
one row, but a different number of columns. Some data frames have 3
columns, some - 6 columns, and some - 9 columns. The names of the
first 3 columns are identical in all 99 data frames (e.g., A, B, and
C). The names of columns 4:6 are identical in data frames that contain
6 and 9 columns (e.g., D, E, and F). So that L looks like this:

L[[1]]
A B C
2 3 4
L[[2]]
A B C D E F
2 1 3 2 4 5
L[[3]]
A B C D E F G H I
1 2 4 3 2 4 5 4 2
L[[4]]
...


How can I merge all of those data frames into one large data frame -
with 99 rows - such that all data are in the columns with correct
names. Of course, I'd like the rows of the new large data frame that
contain the data for less than 9 columns to have NAs in columns 4:9
(or 7:9).
In other words, I want the first 3 rows of the new large data frame to
look like this:
A B C D E F G H I
2 3 4 NA NA NA NA NA NA
2 1 3 2 4 5 NA NA NA
1 2 4 3 2 4 5 4 2

Ideally, I'd like this merge to work for ANY number of individual
small data frames in L - even if their total number within L is
unknown.

I tried merge - but it seems to me that it only works for 2 data
frames, not for many.
Thank you very much!
-- 
Dimitri Liakhovitski
MarketTools, Inc.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging data frames of different length

2008-12-19 Thread Gabor Grothendieck

Try this:

 L - list(data.frame(A=2, B=3, C=4),
+ data.frame(A=2, B=1, C=3, D=2, E=4, F=5),
+ data.frame(A=1, B=2, C=4, D=3, E=2, F=4, G=5, H=4, I=2))

 library(plyr)
 do.call(rbind.fill, L)
  A B C  D  E  F  G  H  I
1 2 3 4 NA NA NA NA NA NA
2 2 1 3  2  4  5 NA NA NA
3 1 2 4  3  2  4  5  4  2


On Fri, Dec 19, 2008 at 5:48 PM, Dimitri Liakhovitski ld7...@gmail.com wrote:
 Hello, everyone!

 I have list L that contains 99 data frames. All data frames have only
 one row, but a different number of columns. Some data frames have 3
 columns, some - 6 columns, and some - 9 columns. The names of the
 first 3 columns are identical in all 99 data frames (e.g., A, B, and
 C). The names of columns 4:6 are identical in data frames that contain
 6 and 9 columns (e.g., D, E, and F). So that L looks like this:

 L[[1]]
 A B C
 2 3 4
 L[[2]]
 A B C D E F
 2 1 3 2 4 5
 L[[3]]
 A B C D E F G H I
 1 2 4 3 2 4 5 4 2
 L[[4]]
 ...


 How can I merge all of those data frames into one large data frame -
 with 99 rows - such that all data are in the columns with correct
 names. Of course, I'd like the rows of the new large data frame that
 contain the data for less than 9 columns to have NAs in columns 4:9
 (or 7:9).
 In other words, I want the first 3 rows of the new large data frame to
 look like this:
 A B C D E F G H I
 2 3 4 NA NA NA NA NA NA
 2 1 3 2 4 5 NA NA NA
 1 2 4 3 2 4 5 4 2

 Ideally, I'd like this merge to work for ANY number of individual
 small data frames in L - even if their total number within L is
 unknown.

 I tried merge - but it seems to me that it only works for 2 data
 frames, not for many.
 Thank you very much!
 --
 Dimitri Liakhovitski
 MarketTools, Inc.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] merging data frames

2008-05-16 Thread Srinivas Iyyer

Dear group, 
I have 3 different data frames. I want to merge all 3
data frames for which there is intersection.

Say DF 1 and DF2 has 100 common elements in Column 1. 

DF3 does not have many intersection either with DF1 or
with DF2. 

For names in column 1 not present in DF3 I want to
introduce NA. 
DF1:
Name   Age
A   21
B   45
C   30

DF2:
Name   Age
A   50
B   20
X   10

DF3:
Name   Age
B   40
Y   21
K   30

I want to merge all 3 into one:


Df4:

Name.1Age.1  Age.2  Age.3
A 21   50 NA  
B 45   20 40
C 30   NA NA
K NA   NA 30
X NA   10 NA
Y NA   NA 21


Could any one help me how can I merge 3 dataframes. 

appreciate your help. Thank you. 

srini

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging data frames

2008-05-16 Thread Yasir Kaheil


 DF1- data.frame(Name=as.factor(c(A,B,C)), Age= c(21,45,30))
 DF2- data.frame(Name=as.factor(c(A,B,X)), Age= c(50,20,10))
 DF3- data.frame(Name=as.factor(c(B,Y,K)), Age= c(40,21,30))
 
 merge(merge(DF1,DF2, by.x= Name, by.y=Name,
 all=TRUE),DF3,by.x=Name,by.y=Name, all=TRUE);
  Name Age.x Age.y Age
1A2150  NA
2B4520  40
3C30NA  NA
4XNA10  NA
5KNANA  30
6YNANA  21

thanks
y

Srinivas Iyyer wrote:
 
 Dear group, 
 I have 3 different data frames. I want to merge all 3
 data frames for which there is intersection.
 
 Say DF 1 and DF2 has 100 common elements in Column 1. 
 
 DF3 does not have many intersection either with DF1 or
 with DF2. 
 
 For names in column 1 not present in DF3 I want to
 introduce NA. 
 DF1:
 Name   Age
 A   21
 B   45
 C   30
 
 DF2:
 Name   Age
 A   50
 B   20
 X   10
 
 DF3:
 Name   Age
 B   40
 Y   21
 K   30
 
 I want to merge all 3 into one:
 
 
 Df4:
 
 Name.1Age.1  Age.2  Age.3
 A 21   50 NA  
 B 45   20 40
 C 30   NA NA
 K NA   NA 30
 X NA   10 NA
 Y NA   NA 21
 
 
 Could any one help me how can I merge 3 dataframes. 
 
 appreciate your help. Thank you. 
 
 srini
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 


-
Yasir H. Kaheil
Catchment Research Facility
The University of Western Ontario 

-- 
View this message in context: 
http://www.nabble.com/merging-data-frames-tp17286503p17287302.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

44 matches

Mail list logo