On Mon, Nov 14, 2011 at 12:50 PM, Simon Slavin <slav...@bigfraud.org> wrote:
>
> On 14 Nov 2011, at 5:11pm, Gabor Grothendieck wrote:
>
>> The requirement for a large number of columns is actually one thing
>> that is often needed when using sqlite from R.  Typically the use case
>> is that a user wishes to read a portion of an external file into R and
>> that file has thousands of columns.  For example, each row might be an
>> individual and each column is a gene.
>
> That would be a text file, right ?  So SQLite isn't involved in that.  You 
> have other routines to read text files.
>

Yes. Its text.  Also this is not application specific.  Its a general
facility that everyone uses so writing application specific routines
is completely out of the question here.

>> Or each row is a time point and
>> each column is a security (stock, bond, etc.)  The file may too large
>> to reasonably handle in memory so rather than deal with it in chunks
>> at a time its easier to just read it into sqlite in its entirety and
>> then pick off the portion you want into R using sql.
>
> So I actually have R on my Mac and I went and had a look.  You're talking 
> about using RSQLite ?  Yes, there are fast ways to move data between a SQLite 
> table and an R matrix.  I can see the appeal.  But you can also execute 
> arbitrary SQL commands.  So you can write your own import/export routine 
> which takes a very wide matrix from R but stores it in a less wide table in 
> SQLite.
>

What users want is to get access to their data with as little hassle
as possible and currently its possible to do it all in one line of R
code which sets up an sqlite database and table, reads the data into
it and then applies a given sql statement to that and finally destroys
the database.  Its trivial to do.  The only limitation is that the
file can have no more than 999 columns as its currently set up.  In
most cases that works but some people have wider files and there are
constantly requests to increase the limit.  All you do is provide the
filename, certain parameters such as the input field separator and
optionally the sql statement (which defaults to select * from file).
Its very easy from the user's point of view. Its just one line of
code.

> What I think you're trying to do is use a SQLite table as a data frame.  If I 
> understand correctly, this means you can use the commands you'd normally use 
> with an R matrix, but with data still stored in a SQLite table, without 
> having to rewrite the code of your program.  It's a really neat hack.
>
> Trouble is, SQLite isn't efficient with such wide tables.  Your code is going 
> to operate really slowly.  I'm going to have to get backup from SQLite 
> experts here, but I understand that unlike how matrices are stored in R, a 
> table's columns are stored in a structure like a linked list.  So if you try 
> to access the 1,400th column it has to walk a list of 1,399 items to find it. 
>  Which isn't efficient.
>

I am not aware of any performance tests on very wide files with sqlite
followed by moving them into R but with the usual files of just a
handful of columns it is sufficiently fast -- its so fast that at
times it can be faster to read it into sqlite and then from there into
R than reading the file straight into R (in those cases where both are
possible). We will see what happens when it gets expanded beyond 999.
A previous thread on this list suggested that there was no real
downside to expanding the limit.  I asked for clarification at the
time but no one responded.

> So fine.  Use SQLite to store tables as wide as you like.  But write your own 
> import/export commands to fetch appropriate parts into memory.  There's no 
> need to use a SQLite table 2000 columns wide just because your matrix is 2000 
> columns wide.  Purely a recommendation for serious software intended for 
> proper use.  Do anything you like in quick hacks: CPU time and memory usage 
> can be stupid big for those.
>

I am not clear on what you are suggesting but the way it works is that
the file gets read in its entirety into an sqlite database and then an
sql statement specified by the user is applied to that and only the
output of the sql statement ever gets sent to R so even if the input
has thousands of columns, the data sent from sqlite to R might not.

Hope that clarifies the situation.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to