Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Gabor Grothendieck
Regarding the explanation of where the time goes it might be parsing
the statement or the development of the query plan.  The SQL statement
for the more complex query is obviously much longer and its generated
query plan involves 95 lines of byte code vs 19 lines of generated
code for the simpler query.

On Thu, Jan 28, 2010 at 2:02 PM, Matthew Dowle  wrote:
> I'm talking about ease of use to.  The first line of the Details section in
> ?"[.data.table" says :
>   "Builds on base R functionality to reduce 2 types of time :
>       1. programming time (easier to write, read, debug and maintain)
>       2. compute time"
>
> Once again, I am merely saying that the user has choices, and the best
> choice (and there are many choices including plyr, and lots of other great
> packages and base methods) depends on the task and the real goal.   This
> choice is not restricted to compute time only, as you seem to suggest.  In
> fact I listed programming time first (i.e ease of use).
>
> To answer your points :
>
> This is the SQL code you posted and I used in the comparison. Notice its
> quite long,  repeats the text "var1,var2,var3" 4 times, contains two
> 'select's and a 'using'.
>> system.time(sqldf("select var1, var2, var3, dt from a, (select var1, var2,
>> var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2,
>> var3) where dt - mindt < 7"))
>   user  system elapsed
>  103.13    2.17  106.23
>
> Isolating the series of operations you described :
>> system.time(sqldf("select * from a"))
>   user  system elapsed
>  39.00    0.63   39.62
>
> So thats roughly 40% of the time. Whats happening in the remaining 66 secs?
>
> Heres a repeat of the equivalent in data.table :
>
>> system.time({adt<<-data.table(a)})
>   user  system elapsed
>   0.90    0.13    1.03
>> system.time(adt[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"])
>> #  is that so hard to use compared to the SQL above ?
>   user  system elapsed
>   3.92    0.78    4.71
>
> I looked at the news section, but I didn't find the benchmarks quickly or
> easily.  The links I saw took me to the FAQs.
>
>
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001280855i1d5f7c03v46f7a3e58ff93...@mail.gmail.com...
> I think one would only be concerned about such internals if one were
> primarily interested in performance; otherwise, one would be more
> interested in ease of specification and part of that ease is having it
> independent of implementation and separating implementation from
> specification activities.  An example of separation of specification
> and implementation is that by simply specifying a disk-based database
> rather than an in-memory database SQL can perform queries that take
> more space than memory.  The query itself need not be modified.
>
> I think the viewpoint you are discussing is primarily one of
> performance whereas the viewpoint I was discussing is primarily ease
> of use and that accounts for the difference.
>
> I believe your performance comparison is comparing a sequence of
> operations that include building a database, transferring data to it,
> performing the operation, reading it back in and destroying the
> database to an internal manipulation.  I would expect the internal
> manipulation, particular one done primarily in C code as is the case
> with data.table, to be faster although some benchmarks of the database
> approach found that it compared surprisingly well to straight R code
> -- some users of sqldf found that for an 8000 row data frame sqldf
> actually ran faster than aggregate and also faster than tapply.  The
> News section on the sqldf home page provides links to their
> benchmarks.  Thus if R is fast enough then its likely that the
> database approach is fast enough too since its even faster.
>
> On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle 
> wrote:
>> Are you claiming that SQL is that utopia? SQL is a row store. It cannot
>> give the user the benefits of column store.
>>
>> For example, why does SQL take 113 seconds in the example in this thread :
>> http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
>> but data.table takes 5 seconds to get the same result ? How come the high
>> level language SQL doesn't appear to hide the user from this detail ?
>>
>> If you are just describing utopia, then of course I agree. It would be
>> great to have a language which hid us from this. In the meantime the user
>> has choices, and the best choice depends on the task and the real goal.
>>
>> "Gabor Grothendieck"  wrote in message
>> news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
>> Its only important internally. Externally its undesirable that the
>> user have to get involved in it. The idea of making software easy to
>> write and use is to hide the implementation and focus on the problem.
>> That is why we use high level languages, object orientation, etc.
>>
>> On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle 
>> wrote:
>>> How it represents data internally is very import

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
I'm talking about ease of use to.  The first line of the Details section in 
?"[.data.table" says :
   "Builds on base R functionality to reduce 2 types of time :
   1. programming time (easier to write, read, debug and maintain)
   2. compute time"

Once again, I am merely saying that the user has choices, and the best 
choice (and there are many choices including plyr, and lots of other great 
packages and base methods) depends on the task and the real goal.   This 
choice is not restricted to compute time only, as you seem to suggest.  In 
fact I listed programming time first (i.e ease of use).

To answer your points :

This is the SQL code you posted and I used in the comparison. Notice its 
quite long,  repeats the text "var1,var2,var3" 4 times, contains two 
'select's and a 'using'.
> system.time(sqldf("select var1, var2, var3, dt from a, (select var1, var2, 
> var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, 
> var3) where dt - mindt < 7"))
   user  system elapsed
 103.132.17  106.23

Isolating the series of operations you described :
> system.time(sqldf("select * from a"))
   user  system elapsed
  39.000.63   39.62

So thats roughly 40% of the time. Whats happening in the remaining 66 secs?

Heres a repeat of the equivalent in data.table :

> system.time({adt<<-data.table(a)})
   user  system elapsed
   0.900.131.03
> system.time(adt[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"]) 
> #  is that so hard to use compared to the SQL above ?
   user  system elapsed
   3.920.784.71

I looked at the news section, but I didn't find the benchmarks quickly or 
easily.  The links I saw took me to the FAQs.



"Gabor Grothendieck"  wrote in message 
news:971536df1001280855i1d5f7c03v46f7a3e58ff93...@mail.gmail.com...
I think one would only be concerned about such internals if one were
primarily interested in performance; otherwise, one would be more
interested in ease of specification and part of that ease is having it
independent of implementation and separating implementation from
specification activities.  An example of separation of specification
and implementation is that by simply specifying a disk-based database
rather than an in-memory database SQL can perform queries that take
more space than memory.  The query itself need not be modified.

I think the viewpoint you are discussing is primarily one of
performance whereas the viewpoint I was discussing is primarily ease
of use and that accounts for the difference.

I believe your performance comparison is comparing a sequence of
operations that include building a database, transferring data to it,
performing the operation, reading it back in and destroying the
database to an internal manipulation.  I would expect the internal
manipulation, particular one done primarily in C code as is the case
with data.table, to be faster although some benchmarks of the database
approach found that it compared surprisingly well to straight R code
-- some users of sqldf found that for an 8000 row data frame sqldf
actually ran faster than aggregate and also faster than tapply.  The
News section on the sqldf home page provides links to their
benchmarks.  Thus if R is fast enough then its likely that the
database approach is fast enough too since its even faster.

On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle  
wrote:
> Are you claiming that SQL is that utopia? SQL is a row store. It cannot
> give the user the benefits of column store.
>
> For example, why does SQL take 113 seconds in the example in this thread :
> http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
> but data.table takes 5 seconds to get the same result ? How come the high
> level language SQL doesn't appear to hide the user from this detail ?
>
> If you are just describing utopia, then of course I agree. It would be
> great to have a language which hid us from this. In the meantime the user
> has choices, and the best choice depends on the task and the real goal.
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
> Its only important internally. Externally its undesirable that the
> user have to get involved in it. The idea of making software easy to
> write and use is to hide the implementation and focus on the problem.
> That is why we use high level languages, object orientation, etc.
>
> On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle 
> wrote:
>> How it represents data internally is very important, depending on the 
>> real
>> goal :
>> http://en.wikipedia.org/wiki/Column-oriented_DBMS
>>
>>
>> "Gabor Grothendieck"  wrote in message
>> news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
>> How it represents data internally should not be important as long as
>> you can do what you want. SQL is declarative so you just specify what
>> you want rather than how to get it and invisibly to the user it
>> automatically draws up a query plan and then uses that plan t

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Gabor Grothendieck
I think one would only be concerned about such internals if one were
primarily interested in performance; otherwise, one would be more
interested in ease of specification and part of that ease is having it
independent of implementation and separating implementation from
specification activities.  An example of separation of specification
and implementation is that by simply specifying a disk-based database
rather than an in-memory database SQL can perform queries that take
more space than memory.  The query itself need not be modified.

I think the viewpoint you are discussing is primarily one of
performance whereas the viewpoint I was discussing is primarily ease
of use and that accounts for the difference.

I believe your performance comparison is comparing a sequence of
operations that include building a database, transferring data to it,
performing the operation, reading it back in and destroying the
database to an internal manipulation.  I would expect the internal
manipulation, particular one done primarily in C code as is the case
with data.table, to be faster although some benchmarks of the database
approach found that it compared surprisingly well to straight R code
-- some users of sqldf found that for an 8000 row data frame sqldf
actually ran faster than aggregate and also faster than tapply.  The
News section on the sqldf home page provides links to their
benchmarks.  Thus if R is fast enough then its likely that the
database approach is fast enough too since its even faster.

On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle  wrote:
> Are you claiming that SQL is that utopia?  SQL is a row store.  It cannot
> give the user the benefits of column store.
>
> For example, why does SQL take 113 seconds in the example in this thread :
> http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
> but data.table takes 5 seconds to get the same result ? How come the high
> level language SQL doesn't appear to hide the user from this detail ?
>
> If you are just describing utopia, then of course I agree.  It would be
> great to have a language which hid us from this.  In the meantime the user
> has choices, and the best choice depends on the task and the real goal.
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
> Its only important internally.  Externally its undesirable that the
> user have to get involved in it.  The idea of making software easy to
> write and use is to hide the implementation and focus on the problem.
> That is why we use high level languages, object orientation, etc.
>
> On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle 
> wrote:
>> How it represents data internally is very important, depending on the real
>> goal :
>> http://en.wikipedia.org/wiki/Column-oriented_DBMS
>>
>>
>> "Gabor Grothendieck"  wrote in message
>> news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
>> How it represents data internally should not be important as long as
>> you can do what you want. SQL is declarative so you just specify what
>> you want rather than how to get it and invisibly to the user it
>> automatically draws up a query plan and then uses that plan to get the
>> result.
>>
>> On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle 
>> wrote:
>>>
 sqldf("select * from BOD order by Time desc limit 3")
>>> Exactly. SQL requires use of order by. It knows the order, but it isn't
>>> ordered. Thats not good, but might be fine, depending on what the real
>>> goal
>>> is.
>>>
>>>
>>> "Gabor Grothendieck"  wrote in message
>>> news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
>>> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle 
>>> wrote:
 How many columns, and of what type are the columns ? As Olga asked too,
 it
 would be useful to know more about what you're really trying to do.

 3.5m rows is not actually that many rows, even for 32bit R. Its depends
 on
 the columns and what you want to do with those columns.

 At the risk of suggesting something before we know the full facts, one
 possibility is to load the data from flat file into data.table. Use
 setkey()
 to set your keys. Use tables() to summarise your various tables. Then do
 your joins etc all-in-R. data.table has fast ways to do those sorts of
 joins (but we need more info about your task).

 Alternatively, you could check out the sqldf website. There is an
 sqlread.csv (or similar name) which can read your files directly into
 SQL
>>>
>>> read.csv.sql
>>>
 instead of going via R. Gabor has some nice examples there about that
 and
 its faster.

 You use some buzzwords which makes me think that SQL may not be
 appropriate
 for your task though. Can't say for sure (because we don't have enough
 information) but its possible you are struggling because SQL has no row
 ordering concept built in. That might be why you've created an increment
>>>
>>> 

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
Are you claiming that SQL is that utopia?  SQL is a row store.  It cannot 
give the user the benefits of column store.

For example, why does SQL take 113 seconds in the example in this thread :
http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
but data.table takes 5 seconds to get the same result ? How come the high 
level language SQL doesn't appear to hide the user from this detail ?

If you are just describing utopia, then of course I agree.  It would be 
great to have a language which hid us from this.  In the meantime the user 
has choices, and the best choice depends on the task and the real goal.

"Gabor Grothendieck"  wrote in message 
news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
Its only important internally.  Externally its undesirable that the
user have to get involved in it.  The idea of making software easy to
write and use is to hide the implementation and focus on the problem.
That is why we use high level languages, object orientation, etc.

On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle  
wrote:
> How it represents data internally is very important, depending on the real
> goal :
> http://en.wikipedia.org/wiki/Column-oriented_DBMS
>
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
> How it represents data internally should not be important as long as
> you can do what you want. SQL is declarative so you just specify what
> you want rather than how to get it and invisibly to the user it
> automatically draws up a query plan and then uses that plan to get the
> result.
>
> On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle 
> wrote:
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Exactly. SQL requires use of order by. It knows the order, but it isn't
>> ordered. Thats not good, but might be fine, depending on what the real
>> goal
>> is.
>>
>>
>> "Gabor Grothendieck"  wrote in message
>> news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
>> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle 
>> wrote:
>>> How many columns, and of what type are the columns ? As Olga asked too,
>>> it
>>> would be useful to know more about what you're really trying to do.
>>>
>>> 3.5m rows is not actually that many rows, even for 32bit R. Its depends
>>> on
>>> the columns and what you want to do with those columns.
>>>
>>> At the risk of suggesting something before we know the full facts, one
>>> possibility is to load the data from flat file into data.table. Use
>>> setkey()
>>> to set your keys. Use tables() to summarise your various tables. Then do
>>> your joins etc all-in-R. data.table has fast ways to do those sorts of
>>> joins (but we need more info about your task).
>>>
>>> Alternatively, you could check out the sqldf website. There is an
>>> sqlread.csv (or similar name) which can read your files directly into 
>>> SQL
>>
>> read.csv.sql
>>
>>> instead of going via R. Gabor has some nice examples there about that 
>>> and
>>> its faster.
>>>
>>> You use some buzzwords which makes me think that SQL may not be
>>> appropriate
>>> for your task though. Can't say for sure (because we don't have enough
>>> information) but its possible you are struggling because SQL has no row
>>> ordering concept built in. That might be why you've created an increment
>>
>> In the SQLite database it automatically assigns a self incrementing
>> hidden column called rowid to each row. e.g. using SQLite via the
>> sqldf package on CRAN and the BOD data frame which is built into R we
>> can display the rowid column explicitly by referring to it in our
>> select statement:
>>
>>> library(sqldf)
>>> BOD
>> Time demand
>> 1 1 8.3
>> 2 2 10.3
>> 3 3 19.0
>> 4 4 16.0
>> 5 5 15.6
>> 6 7 19.8
>>> sqldf("select rowid, * from BOD")
>> rowid Time demand
>> 1 1 1 8.3
>> 2 2 2 10.3
>> 3 3 3 19.0
>> 4 4 4 16.0
>> 5 5 5 15.6
>> 6 6 7 19.8
>>
>>
>>> field? Do your queries include "order by incrementing field"? SQL is not
>>> good at "first" and "last" type logic. An all-in-R solution may well be
>>
>> In SQLite you can get the top 3 values, say, like this (continuing the
>> prior example):
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Time demand
>> 1 7 19.8
>> 2 5 15.6
>> 3 4 16.0
>>
>>> better, since R is very good with ordered vectors. A 1GB data.table (or
>>> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
>>> 38 double columns. 1GB is well within 32bit and allows some space for
>>> working copies, depending on what you want to do with the data. If you
>>> have
>>> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
>>> get your task done quicker, depending on what your real goal is.
>>>
>>> If this sounds plausible, you could post more details and, if its
>>> appropriate, and luck is on your side, someone might even sketch out how
>>> to
>>> do an all-in-R solution.
>>>
>>>
>>> "Nathan S. Watson-Haigh"  wrote in message
>>> news:4b5fde1b.10...@csiro.a

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Gabor Grothendieck
Its only important internally.  Externally its undesirable that the
user have to get involved in it.  The idea of making software easy to
write and use is to hide the implementation and focus on the problem.
That is why we use high level languages, object orientation, etc.

On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle  wrote:
> How it represents data internally is very important,  depending on the real
> goal :
> http://en.wikipedia.org/wiki/Column-oriented_DBMS
>
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
> How it represents data internally should not be important as long as
> you can do what you want.  SQL is declarative so you just specify what
> you want rather than how to get it and invisibly to the user it
> automatically draws up a query plan and then uses that plan to get the
> result.
>
> On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle 
> wrote:
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Exactly. SQL requires use of order by. It knows the order, but it isn't
>> ordered. Thats not good, but might be fine, depending on what the real
>> goal
>> is.
>>
>>
>> "Gabor Grothendieck"  wrote in message
>> news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
>> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle 
>> wrote:
>>> How many columns, and of what type are the columns ? As Olga asked too,
>>> it
>>> would be useful to know more about what you're really trying to do.
>>>
>>> 3.5m rows is not actually that many rows, even for 32bit R. Its depends
>>> on
>>> the columns and what you want to do with those columns.
>>>
>>> At the risk of suggesting something before we know the full facts, one
>>> possibility is to load the data from flat file into data.table. Use
>>> setkey()
>>> to set your keys. Use tables() to summarise your various tables. Then do
>>> your joins etc all-in-R. data.table has fast ways to do those sorts of
>>> joins (but we need more info about your task).
>>>
>>> Alternatively, you could check out the sqldf website. There is an
>>> sqlread.csv (or similar name) which can read your files directly into SQL
>>
>> read.csv.sql
>>
>>> instead of going via R. Gabor has some nice examples there about that and
>>> its faster.
>>>
>>> You use some buzzwords which makes me think that SQL may not be
>>> appropriate
>>> for your task though. Can't say for sure (because we don't have enough
>>> information) but its possible you are struggling because SQL has no row
>>> ordering concept built in. That might be why you've created an increment
>>
>> In the SQLite database it automatically assigns a self incrementing
>> hidden column called rowid to each row. e.g. using SQLite via the
>> sqldf package on CRAN and the BOD data frame which is built into R we
>> can display the rowid column explicitly by referring to it in our
>> select statement:
>>
>>> library(sqldf)
>>> BOD
>> Time demand
>> 1 1 8.3
>> 2 2 10.3
>> 3 3 19.0
>> 4 4 16.0
>> 5 5 15.6
>> 6 7 19.8
>>> sqldf("select rowid, * from BOD")
>> rowid Time demand
>> 1 1 1 8.3
>> 2 2 2 10.3
>> 3 3 3 19.0
>> 4 4 4 16.0
>> 5 5 5 15.6
>> 6 6 7 19.8
>>
>>
>>> field? Do your queries include "order by incrementing field"? SQL is not
>>> good at "first" and "last" type logic. An all-in-R solution may well be
>>
>> In SQLite you can get the top 3 values, say, like this (continuing the
>> prior example):
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Time demand
>> 1 7 19.8
>> 2 5 15.6
>> 3 4 16.0
>>
>>> better, since R is very good with ordered vectors. A 1GB data.table (or
>>> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
>>> 38 double columns. 1GB is well within 32bit and allows some space for
>>> working copies, depending on what you want to do with the data. If you
>>> have
>>> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
>>> get your task done quicker, depending on what your real goal is.
>>>
>>> If this sounds plausible, you could post more details and, if its
>>> appropriate, and luck is on your side, someone might even sketch out how
>>> to
>>> do an all-in-R solution.
>>>
>>>
>>> "Nathan S. Watson-Haigh"  wrote in message
>>> news:4b5fde1b.10...@csiro.au...
I have a table (contact) with several fields and it's PK is an auto
increment field. I'm bulk loading data to this table from files which if
successful will be about 3.5million rows (approx 16000 rows per file).
However, I have a linking table (an_contact) to resolve a m:m
relationship
between the an and contact tables. How can I retrieve the PK's for the
data
bulk loaded into contact so I can insert the relevant data into
an_contact.

 I currently load the data into contact using:
 dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)

 But I then need to get all the PK's which this dbWriteTable() appended
 to
 the contact table so I can load the dat

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
How it represents data internally is very important,  depending on the real 
goal :
http://en.wikipedia.org/wiki/Column-oriented_DBMS


"Gabor Grothendieck"  wrote in message 
news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
How it represents data internally should not be important as long as
you can do what you want.  SQL is declarative so you just specify what
you want rather than how to get it and invisibly to the user it
automatically draws up a query plan and then uses that plan to get the
result.

On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle  
wrote:
>
>> sqldf("select * from BOD order by Time desc limit 3")
> Exactly. SQL requires use of order by. It knows the order, but it isn't
> ordered. Thats not good, but might be fine, depending on what the real 
> goal
> is.
>
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle 
> wrote:
>> How many columns, and of what type are the columns ? As Olga asked too, 
>> it
>> would be useful to know more about what you're really trying to do.
>>
>> 3.5m rows is not actually that many rows, even for 32bit R. Its depends 
>> on
>> the columns and what you want to do with those columns.
>>
>> At the risk of suggesting something before we know the full facts, one
>> possibility is to load the data from flat file into data.table. Use
>> setkey()
>> to set your keys. Use tables() to summarise your various tables. Then do
>> your joins etc all-in-R. data.table has fast ways to do those sorts of
>> joins (but we need more info about your task).
>>
>> Alternatively, you could check out the sqldf website. There is an
>> sqlread.csv (or similar name) which can read your files directly into SQL
>
> read.csv.sql
>
>> instead of going via R. Gabor has some nice examples there about that and
>> its faster.
>>
>> You use some buzzwords which makes me think that SQL may not be
>> appropriate
>> for your task though. Can't say for sure (because we don't have enough
>> information) but its possible you are struggling because SQL has no row
>> ordering concept built in. That might be why you've created an increment
>
> In the SQLite database it automatically assigns a self incrementing
> hidden column called rowid to each row. e.g. using SQLite via the
> sqldf package on CRAN and the BOD data frame which is built into R we
> can display the rowid column explicitly by referring to it in our
> select statement:
>
>> library(sqldf)
>> BOD
> Time demand
> 1 1 8.3
> 2 2 10.3
> 3 3 19.0
> 4 4 16.0
> 5 5 15.6
> 6 7 19.8
>> sqldf("select rowid, * from BOD")
> rowid Time demand
> 1 1 1 8.3
> 2 2 2 10.3
> 3 3 3 19.0
> 4 4 4 16.0
> 5 5 5 15.6
> 6 6 7 19.8
>
>
>> field? Do your queries include "order by incrementing field"? SQL is not
>> good at "first" and "last" type logic. An all-in-R solution may well be
>
> In SQLite you can get the top 3 values, say, like this (continuing the
> prior example):
>
>> sqldf("select * from BOD order by Time desc limit 3")
> Time demand
> 1 7 19.8
> 2 5 15.6
> 3 4 16.0
>
>> better, since R is very good with ordered vectors. A 1GB data.table (or
>> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
>> 38 double columns. 1GB is well within 32bit and allows some space for
>> working copies, depending on what you want to do with the data. If you
>> have
>> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
>> get your task done quicker, depending on what your real goal is.
>>
>> If this sounds plausible, you could post more details and, if its
>> appropriate, and luck is on your side, someone might even sketch out how
>> to
>> do an all-in-R solution.
>>
>>
>> "Nathan S. Watson-Haigh"  wrote in message
>> news:4b5fde1b.10...@csiro.au...
>>>I have a table (contact) with several fields and it's PK is an auto
>>>increment field. I'm bulk loading data to this table from files which if
>>>successful will be about 3.5million rows (approx 16000 rows per file).
>>>However, I have a linking table (an_contact) to resolve a m:m 
>>>relationship
>>>between the an and contact tables. How can I retrieve the PK's for the
>>>data
>>>bulk loaded into contact so I can insert the relevant data into
>>>an_contact.
>>>
>>> I currently load the data into contact using:
>>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>>
>>> But I then need to get all the PK's which this dbWriteTable() appended 
>>> to
>>> the contact table so I can load the data into my an_contact link table. 
>>> I
>>> don't want to issue a separate INSERT query for each row in dat and then
>>> use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows
>>> to
>>> insert!
>>>
>>> Any pointers welcome,
>>> Nathan
>>>
>>> --
>>> 
>>> Dr. Nathan S. Watson-Haigh
>>> OCE Post Doctoral Fellow
>>> CSIRO Livestock Industries
>>> University Drive
>>> Townsville, QLD 48

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Gabor Grothendieck
How it represents data internally should not be important as long as
you can do what you want.  SQL is declarative so you just specify what
you want rather than how to get it and invisibly to the user it
automatically draws up a query plan and then uses that plan to get the
result.

On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle  wrote:
>
>> sqldf("select * from BOD order by Time desc limit 3")
> Exactly. SQL requires use of order by. It knows the order, but it isn't
> ordered. Thats not good, but might be fine, depending on what the real goal
> is.
>
>
> "Gabor Grothendieck"  wrote in message
> news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle 
> wrote:
>> How many columns, and of what type are the columns ? As Olga asked too, it
>> would be useful to know more about what you're really trying to do.
>>
>> 3.5m rows is not actually that many rows, even for 32bit R. Its depends on
>> the columns and what you want to do with those columns.
>>
>> At the risk of suggesting something before we know the full facts, one
>> possibility is to load the data from flat file into data.table. Use
>> setkey()
>> to set your keys. Use tables() to summarise your various tables. Then do
>> your joins etc all-in-R. data.table has fast ways to do those sorts of
>> joins (but we need more info about your task).
>>
>> Alternatively, you could check out the sqldf website. There is an
>> sqlread.csv (or similar name) which can read your files directly into SQL
>
> read.csv.sql
>
>> instead of going via R. Gabor has some nice examples there about that and
>> its faster.
>>
>> You use some buzzwords which makes me think that SQL may not be
>> appropriate
>> for your task though. Can't say for sure (because we don't have enough
>> information) but its possible you are struggling because SQL has no row
>> ordering concept built in. That might be why you've created an increment
>
> In the SQLite database it automatically assigns a self incrementing
> hidden column called rowid to each row. e.g. using SQLite via the
> sqldf package on CRAN and the BOD data frame which is built into R we
> can display the rowid column explicitly by referring to it in our
> select statement:
>
>> library(sqldf)
>> BOD
>  Time demand
> 1    1    8.3
> 2    2   10.3
> 3    3   19.0
> 4    4   16.0
> 5    5   15.6
> 6    7   19.8
>> sqldf("select rowid, * from BOD")
>  rowid Time demand
> 1     1    1    8.3
> 2     2    2   10.3
> 3     3    3   19.0
> 4     4    4   16.0
> 5     5    5   15.6
> 6     6    7   19.8
>
>
>> field? Do your queries include "order by incrementing field"? SQL is not
>> good at "first" and "last" type logic. An all-in-R solution may well be
>
> In SQLite you can get the top 3 values, say, like this (continuing the
> prior example):
>
>> sqldf("select * from BOD order by Time desc limit 3")
>  Time demand
> 1    7   19.8
> 2    5   15.6
> 3    4   16.0
>
>> better, since R is very good with ordered vectors. A 1GB data.table (or
>> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
>> 38 double columns. 1GB is well within 32bit and allows some space for
>> working copies, depending on what you want to do with the data. If you
>> have
>> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
>> get your task done quicker, depending on what your real goal is.
>>
>> If this sounds plausible, you could post more details and, if its
>> appropriate, and luck is on your side, someone might even sketch out how
>> to
>> do an all-in-R solution.
>>
>>
>> "Nathan S. Watson-Haigh"  wrote in message
>> news:4b5fde1b.10...@csiro.au...
>>>I have a table (contact) with several fields and it's PK is an auto
>>>increment field. I'm bulk loading data to this table from files which if
>>>successful will be about 3.5million rows (approx 16000 rows per file).
>>>However, I have a linking table (an_contact) to resolve a m:m relationship
>>>between the an and contact tables. How can I retrieve the PK's for the
>>>data
>>>bulk loaded into contact so I can insert the relevant data into
>>>an_contact.
>>>
>>> I currently load the data into contact using:
>>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>>
>>> But I then need to get all the PK's which this dbWriteTable() appended to
>>> the contact table so I can load the data into my an_contact link table. I
>>> don't want to issue a separate INSERT query for each row in dat and then
>>> use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows
>>> to
>>> insert!
>>>
>>> Any pointers welcome,
>>> Nathan
>>>
>>> --
>>> 
>>> Dr. Nathan S. Watson-Haigh
>>> OCE Post Doctoral Fellow
>>> CSIRO Livestock Industries
>>> University Drive
>>> Townsville, QLD 4810
>>> Australia
>>>
>>> Tel: +61 (0)7 4753 8548
>>> Fax: +61 (0)7 4753 8600
>>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>>
>>
>> __

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Matthew Dowle

> sqldf("select * from BOD order by Time desc limit 3")
Exactly. SQL requires use of order by. It knows the order, but it isn't 
ordered. Thats not good, but might be fine, depending on what the real goal 
is.


"Gabor Grothendieck"  wrote in message 
news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle  
wrote:
> How many columns, and of what type are the columns ? As Olga asked too, it
> would be useful to know more about what you're really trying to do.
>
> 3.5m rows is not actually that many rows, even for 32bit R. Its depends on
> the columns and what you want to do with those columns.
>
> At the risk of suggesting something before we know the full facts, one
> possibility is to load the data from flat file into data.table. Use 
> setkey()
> to set your keys. Use tables() to summarise your various tables. Then do
> your joins etc all-in-R. data.table has fast ways to do those sorts of
> joins (but we need more info about your task).
>
> Alternatively, you could check out the sqldf website. There is an
> sqlread.csv (or similar name) which can read your files directly into SQL

read.csv.sql

> instead of going via R. Gabor has some nice examples there about that and
> its faster.
>
> You use some buzzwords which makes me think that SQL may not be 
> appropriate
> for your task though. Can't say for sure (because we don't have enough
> information) but its possible you are struggling because SQL has no row
> ordering concept built in. That might be why you've created an increment

In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:

> library(sqldf)
> BOD
  Time demand
118.3
22   10.3
33   19.0
44   16.0
55   15.6
67   19.8
> sqldf("select rowid, * from BOD")
  rowid Time demand
1 118.3
2 22   10.3
3 33   19.0
4 44   16.0
5 55   15.6
6 67   19.8


> field? Do your queries include "order by incrementing field"? SQL is not
> good at "first" and "last" type logic. An all-in-R solution may well be

In SQLite you can get the top 3 values, say, like this (continuing the
prior example):

> sqldf("select * from BOD order by Time desc limit 3")
  Time demand
17   19.8
25   15.6
34   16.0

> better, since R is very good with ordered vectors. A 1GB data.table (or
> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
> 38 double columns. 1GB is well within 32bit and allows some space for
> working copies, depending on what you want to do with the data. If you 
> have
> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
> get your task done quicker, depending on what your real goal is.
>
> If this sounds plausible, you could post more details and, if its
> appropriate, and luck is on your side, someone might even sketch out how 
> to
> do an all-in-R solution.
>
>
> "Nathan S. Watson-Haigh"  wrote in message
> news:4b5fde1b.10...@csiro.au...
>>I have a table (contact) with several fields and it's PK is an auto
>>increment field. I'm bulk loading data to this table from files which if
>>successful will be about 3.5million rows (approx 16000 rows per file).
>>However, I have a linking table (an_contact) to resolve a m:m relationship
>>between the an and contact tables. How can I retrieve the PK's for the 
>>data
>>bulk loaded into contact so I can insert the relevant data into 
>>an_contact.
>>
>> I currently load the data into contact using:
>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>
>> But I then need to get all the PK's which this dbWriteTable() appended to
>> the contact table so I can load the data into my an_contact link table. I
>> don't want to issue a separate INSERT query for each row in dat and then
>> use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows 
>> to
>> insert!
>>
>> Any pointers welcome,
>> Nathan
>>
>> --
>> 
>> Dr. Nathan S. Watson-Haigh
>> OCE Post Doctoral Fellow
>> CSIRO Livestock Industries
>> University Drive
>> Townsville, QLD 4810
>> Australia
>>
>> Tel: +61 (0)7 4753 8548
>> Fax: +61 (0)7 4753 8600
>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commen

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Gabor Grothendieck
On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle  wrote:
> How many columns, and of what type are the columns ? As Olga asked too, it
> would be useful to know more about what you're really trying to do.
>
> 3.5m rows is not actually that many rows, even for 32bit R.  Its depends on
> the columns and what you want to do with those columns.
>
> At the risk of suggesting something before we know the full facts,  one
> possibility is to load the data from flat file into data.table. Use setkey()
> to set your keys. Use tables() to summarise your various tables. Then do
> your joins etc all-in-R.  data.table has fast ways to do those sorts of
> joins (but we need more info about your task).
>
> Alternatively,  you could check out the sqldf website. There is an
> sqlread.csv (or similar name) which can read your files directly into SQL

read.csv.sql

> instead of going via R. Gabor has some nice examples there about that and
> its faster.
>
> You use some buzzwords which makes me think that SQL may not be appropriate
> for your task though.  Can't say for sure (because we don't have enough
> information) but its possible you are struggling because SQL has no row
> ordering concept built in.  That might be why you've created an increment

In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:

> library(sqldf)
> BOD
  Time demand
118.3
22   10.3
33   19.0
44   16.0
55   15.6
67   19.8
> sqldf("select rowid, * from BOD")
  rowid Time demand
1 118.3
2 22   10.3
3 33   19.0
4 44   16.0
5 55   15.6
6 67   19.8


> field?  Do your queries include "order by incrementing field"? SQL is not
> good at "first" and "last" type logic. An all-in-R solution may well be

In SQLite you can get the top 3 values, say, like this (continuing the
prior example):

> sqldf("select * from BOD order by Time desc limit 3")
  Time demand
17   19.8
25   15.6
34   16.0

> better, since R is very good with ordered vectors. A 1GB data.table (or
> data.frame) for example,  at 3.5m rows,  could have 76 integer columns, or
> 38 double columns. 1GB is well within 32bit and allows some space for
> working copies, depending on what you want to do with the data.  If you have
> 38 or less columns,  or you have 64bit,  then an all-in-R solution *might*
> get your task done quicker,  depending on what your real goal is.
>
> If this sounds plausible,  you could post more details and, if its
> appropriate, and luck is on your side, someone might even sketch out how to
> do an all-in-R solution.
>
>
> "Nathan S. Watson-Haigh"  wrote in message
> news:4b5fde1b.10...@csiro.au...
>>I have a table (contact) with several fields and it's PK is an auto
>>increment field. I'm bulk loading data to this table from files which if
>>successful will be about 3.5million rows (approx 16000 rows per file).
>>However, I have a linking table (an_contact) to resolve a m:m relationship
>>between the an and contact tables. How can I retrieve the PK's for the data
>>bulk loaded into contact so I can insert the relevant data into an_contact.
>>
>> I currently load the data into contact using:
>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>
>> But I then need to get all the PK's which this dbWriteTable() appended to
>> the contact table so I can load the data into my an_contact link table. I
>> don't want to issue a separate INSERT query for each row in dat and then
>> use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows to
>> insert!
>>
>> Any pointers welcome,
>> Nathan
>>
>> --
>> 
>> Dr. Nathan S. Watson-Haigh
>> OCE Post Doctoral Fellow
>> CSIRO Livestock Industries
>> University Drive
>> Townsville, QLD 4810
>> Australia
>>
>> Tel: +61 (0)7 4753 8548
>> Fax: +61 (0)7 4753 8600
>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Matthew Dowle
How many columns, and of what type are the columns ? As Olga asked too, it 
would be useful to know more about what you're really trying to do.

3.5m rows is not actually that many rows, even for 32bit R.  Its depends on 
the columns and what you want to do with those columns.

At the risk of suggesting something before we know the full facts,  one 
possibility is to load the data from flat file into data.table. Use setkey() 
to set your keys. Use tables() to summarise your various tables. Then do 
your joins etc all-in-R.  data.table has fast ways to do those sorts of 
joins (but we need more info about your task).

Alternatively,  you could check out the sqldf website. There is an 
sqlread.csv (or similar name) which can read your files directly into SQL 
instead of going via R. Gabor has some nice examples there about that and 
its faster.

You use some buzzwords which makes me think that SQL may not be appropriate 
for your task though.  Can't say for sure (because we don't have enough 
information) but its possible you are struggling because SQL has no row 
ordering concept built in.  That might be why you've created an increment 
field?  Do your queries include "order by incrementing field"? SQL is not 
good at "first" and "last" type logic. An all-in-R solution may well be 
better, since R is very good with ordered vectors. A 1GB data.table (or 
data.frame) for example,  at 3.5m rows,  could have 76 integer columns, or 
38 double columns. 1GB is well within 32bit and allows some space for 
working copies, depending on what you want to do with the data.  If you have 
38 or less columns,  or you have 64bit,  then an all-in-R solution *might* 
get your task done quicker,  depending on what your real goal is.

If this sounds plausible,  you could post more details and, if its 
appropriate, and luck is on your side, someone might even sketch out how to 
do an all-in-R solution.


"Nathan S. Watson-Haigh"  wrote in message 
news:4b5fde1b.10...@csiro.au...
>I have a table (contact) with several fields and it's PK is an auto 
>increment field. I'm bulk loading data to this table from files which if 
>successful will be about 3.5million rows (approx 16000 rows per file). 
>However, I have a linking table (an_contact) to resolve a m:m relationship 
>between the an and contact tables. How can I retrieve the PK's for the data 
>bulk loaded into contact so I can insert the relevant data into an_contact.
>
> I currently load the data into contact using:
> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>
> But I then need to get all the PK's which this dbWriteTable() appended to 
> the contact table so I can load the data into my an_contact link table. I 
> don't want to issue a separate INSERT query for each row in dat and then 
> use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows to 
> insert!
>
> Any pointers welcome,
> Nathan
>
> -- 
> 
> Dr. Nathan S. Watson-Haigh
> OCE Post Doctoral Fellow
> CSIRO Livestock Industries
> University Drive
> Townsville, QLD 4810
> Australia
>
> Tel: +61 (0)7 4753 8548
> Fax: +61 (0)7 4753 8600
> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Olga Lyashevska

Hi Nathan,

I have a table (contact) with several fields and it's PK is an auto  
increment field. I'm bulk loading data to this table from files  
which if successful will be about 3.5million rows (approx 16000 rows  
per file). However, I have a linking table (an_contact) to resolve a  
m:m relationship between the an and contact tables. How can I  
retrieve the PK's for the data bulk loaded into contact so I can  
insert the relevant data into an_contact.


I currently load the data into contact using: dbWriteTable(con,  
"contact", dat, append=TRUE, row.names=FALSE)


But I then need to get all the PK's which this dbWriteTable()  
appended to the contact table so I can load the data into my  
an_contact link table. I don't want to issue a separate INSERT query  
for each row in dat and then use MySQLs LAST_INSERT_ID()  
functionnot when I have 3.5million rows to insert!


Any pointers welcome,
Nathan

It looks to me more like sql question. Why don't you use sql to write  
a query which will join all tables for you and then use RMySQL to  
retrieve relevant data?

Of course,  you could also dbGetQuery.

Can you please explain a bit how you entity-relationship diagram looks  
like?



Cheers,
Olga

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.