RSQLite package can read files into an SQLite database without the data going through R. sqldf package provides a front end that makes it particularly easy to use - basically you need only a couple of lines of code. Other databases have similar facilities. See:
http://sqldf.googlecode.com On Thu, Aug 21, 2008 at 2:32 PM, Avram Aelony <[EMAIL PROTECTED]> wrote: > > Dear R community, > > I find R fantastic and use R whenever I can for my data analytic needs. > Certain data sets, however, are so large that other tools seem to be needed > to pre-process data such that it can be brought into R for further analysis. > > Questions I have for the many expert contributors on this list are: > > 1. How do others handle situations of large data sets (gigabytes, terabytes) > for analysis in R ? > > 2. Are there existing ways or plans to devise ways to use the R language to > interact with Hadoop or PIG ? The Hadoop project by Apache has been > successful at processing data on a large scale using the map-reduce > algorithm. A sister project uses an emerging language called "PIG-latin" or > simply "PIG" for using the Hadoop framework in a manner reminiscent of the > look and feel of R. Is there an opportunity here to create a conceptual > bridge since these projects are also open-source? Does it already exist? > > > Thanks in advance for your comments. > > -Avram > > > > > --------------------------- > Information about Hadoop: > http://wiki.apache.org/hadoop/ > http://en.wikipedia.org/wiki/Hadoop > > "Apache Hadoop is a free Java software framework that supports data > intensive distributed applications running on large clusters of commodity > computers.[1] It enables applications to work with thousands of nodes and > petabytes of data. Hadoop was inspired by Google's MapReduce and Google File > System (GFS) papers." > > > > --------------------------- > Information about PIG: > > http://incubator.apache.org/pig/ > > "Pig is a platform for analyzing large data sets that consists of a > high-level language for expressing data analysis programs, coupled with > infrastructure for evaluating these programs. The salient property of Pig > programs is that their structure is amenable to substantial parallelization, > which in turns enables them to handle very large data sets. > At the present time, Pig's infrastructure layer consists of a compiler that > produces sequences of Map-Reduce programs, for which large-scale parallel > implementations already exist (e.g., the Hadoop subproject). Pig's language > layer currently consists of a textual language called Pig Latin, which has > the following key properties: > > * Ease of programming. It is trivial to achieve parallel execution of > simple, "embarrassingly parallel" data analysis tasks. Complex tasks > comprised of multiple interrelated data transformations are explicitly > encoded as data flow sequences, making them easy to write, understand, and > maintain. > * Optimization opportunities. The way in which tasks are encoded permits the > system to optimize their execution automatically, allowing the user to focus > on semantics rather than efficiency. > * Extensibility. Users can create their own functions to do special-purpose > processing." > > ---------------------------______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.