Thank you very much for these clarifying responses, Gabor. I had mistakenly assumed that creating the index on Tid restricted the natural join to joining on Tid. Can you describe when and how indices speed up joins, or can you point me to resources that address this? Is it only for natural joins or any joins (including, say, a select statement with where clause)?
thanks, nick On Thu, May 20, 2010 at 11:42 AM, Gabor Grothendieck < ggrothendi...@gmail.com> wrote: > Although that works I had meant to write: > > > names(B)[2] <- "dfNameB" > > # ... other commands > > sqldf('select * from main.A natural join main.B') > > so that now only Tid is in common so the natural join just picks it up > and also the heuristic works again since we no longer retrieve > duplicate column names. > > On Thu, May 20, 2010 at 12:32 PM, Gabor Grothendieck > <ggrothendi...@gmail.com> wrote: > > There are two problems: > > > > 1. A natural join will join all columns with the same names in the two > > tables and that includes not only Tid but also dfName and since there > > are no rows that have the same Tid and dfName the result has zero > > rows. > > > > 2. the heuristic it uses fails when you retrieve the same column name > > from multiple tables so use method = "raw" to turn off the heuristic. > > The heuristic will be improved to cover this case in the future. > > Read FAQ #1 on the home page: > > > http://code.google.com/p/sqldf/#1._How_does_sqldf_handle_classes_and_factors > ? > > > > This should work: > > > >> sqldf('select * from main.A join main.B using(Tid)', method = "raw") > > Tid dfName dfName > > 1 AES 01-01-02 11:53:00 a b > > 2 AES 01-01-05\n10:58:00 a b > > 3 AES 01-01-11 12:30:00 a b > > > > This works too as the double dfName no longer exists to confuse the > heuristic: > > > > names(B)[2] <- "dfNameB" > > sqldf('select * from main.A join main.B using(Tid)') > > > > > > > > On Thu, May 20, 2010 at 12:04 PM, Nick Switanek <nswita...@gmail.com> > wrote: > >> Hello, > >> > >> I'm having trouble discovering what's going wrong with my use of natural > >> joins via sqldf. > >> > >> Following the instructions under 4i at http://code.google.com/p/sqldf/, > >> which discusses creating indices to speed joins, I have been only > unreliably > >> able to get natural joins to work. > >> > >> For example, > >> > >>> Tid <- c('AES 01-01-02 10:58:00', 'AES 01-01-02 11:53:00', 'AES > 01-01-05 > >> 10:58:00', 'AES 01-01-11 12:30:00') > >>> A <- data.frame(Tid, dfName = 'a') > >>> B <- data.frame(Tid = Tid[2:4], dfName = 'b') > >>> C <- data.frame(Tid = Tid[1:3], dfName = 'c') > >> > >> # then use the sqldf library > >>> library(sqldf) > >>> sqldf() > >> > >> # to create indices on the Tid variable shared across data.frames > >>> sqldf('create index indA on A(Tid)') > >>> sqldf('create index indB on B(Tid)') > >>> sqldf('create index indC on C(Tid)') > >> > >> # check to make sure everything is there > >>> sqldf('select * from sqlite_master') > >> > >> # doing a natural join (implicitly on Tid) > >> # does not give the expected joins > >>> sqldf('select * from main.A natural join main.B') > >> [1] Tid dfName > >> <0 rows> (or 0-length row.names) > >>> sqldf('select * from main.A natural join main.C') > >> [1] Tid dfName > >> <0 rows> (or 0-length row.names) > >>> sqldf('select * from main.B natural join main.C') > >> [1] Tid dfName > >> <0 rows> (or 0-length row.names) > >> > >> # even using a where clause (which doesn't have the efficiency qualities > I > >> need the indexed natural joins for) is problematic, setting values of > the > >> dfName variable incorrectly for the data from C > >>> sqldf('select * from main.B b, main.C c where b.Tid = c.Tid') > >> Tid dfName Tid dfName > >> 1 AES 01-01-02 11:53:00 b AES 01-01-02 11:53:00 b > >> 2 AES 01-01-05 10:58:00 b AES 01-01-05 10:58:00 b > >> > >> I'm grateful for your guidance on what I'm doing wrong with the natural > join > >> in sqldf. > >> > >> many thanks, > >> Nick > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.