Re: [R] Large data and space use

2021-11-28 Thread Avi Gross via R-help
Richard,

 

I currently have no problem with running out of memory. I was referring to 
people who have said they use LARGE structures and I am pointing out how they 
can temporarily get way larger even when not expected. Functions that 
temporarily will balloon up might come with notifications. And, yes, some 
transformations may well be doable outside R or in chunks. What gets me is how 
often users have no idea what happens when they invoke a package.

 

I am not against transformations and needed duplications. I am more interested 
in whether some existing code might be evaluated and updated in somewhat 
harmless ways as in removing stuff as soon as it is definitely not needed. Of 
course there are tradeoffs. I have seen times only one column of a data.frame 
was needed and the entire data.frame was copied and then returned. That is OK 
but clearly it might be more economical to ask just for a single column to be 
changed in place. People often use a sledgehammer when a thumbtack will do.

 

But as noted, R has features that often delay things so a full copy is not made 
and thus less memory is ever used. But people seem to think that since all 
“local” memory is generally returned when the function ends, so why bother 
micromanaging it as it runs.

 

Arguably, some R packages may make changes in what is kept and for how long. 
Standard R lets you specify what rows and what columns of a data.frame to keep 
in a single argument as in df[rows, columns] while something like dplyr offers 
multiple smaller steps in a grammar of sorts so you do something like a 
select() followed (often in a pipeline) by a filter() or done in the opposite 
order. Each additional change is sometimes done by programmers in minimal steps 
so that a more efficient implementation is harder to do as each one does just 
one thing well. That may also be a plus, especially if pipelined objects are 
released in progress and not all at the end of the pipeline.

 

From: Richard O'Keefe  
Sent: Sunday, November 28, 2021 3:54 AM
To: Avi Gross 
Cc: R-help Mailing List 
Subject: Re: [R] Large data and space use

 

If you have enough data that running out of memory is a serious problem,

then a language like R or Python or Octave or Matlab that offers you NO

control over storage may not be the best choice.  You might need to

consider Julia or even Rust.

 

However, if you have enough data that running out of memory is a serious

problem, your problems may be worse than you think.  In 2021, Linux is

*still* having OOM Killer problems.

https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/

Your process hogging memory may cause some other process to be killed.

Even if that doesn't happen, your process may be simply thrown off the

machine without being warned.

 

It may be one of the biggest problems around in statistical computing:

how to make it straightforward to carve up a problem so that it can be

run on many machines.  R has the 'Rmpi' and 'snow' packages, amongst others.

https://CRAN.R-project.org/view=HighPerformanceComputing

 

Another approach is to select and transform data outside R.  If you have

data in some kind of data base then doing select and transform in the

data base may be a good approach.

 

 

On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?



An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.



What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.



R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.



What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.



The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they 

Re: [R] Large data and space use

2021-11-28 Thread Richard O'Keefe
If you have enough data that running out of memory is a serious problem,
then a language like R or Python or Octave or Matlab that offers you NO
control over storage may not be the best choice.  You might need to
consider Julia or even Rust.

However, if you have enough data that running out of memory is a serious
problem, your problems may be worse than you think.  In 2021, Linux is
*still* having OOM Killer problems.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
Your process hogging memory may cause some other process to be killed.
Even if that doesn't happen, your process may be simply thrown off the
machine without being warned.

It may be one of the biggest problems around in statistical computing:
how to make it straightforward to carve up a problem so that it can be
run on many machines.  R has the 'Rmpi' and 'snow' packages, amongst others.
https://CRAN.R-project.org/view=HighPerformanceComputing

Another approach is to select and transform data outside R.  If you have
data in some kind of data base then doing select and transform in the
data base may be a good approach.


On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help 
wrote:

> Several recent questions and answers have mad e me look at some code and I
> realized that some functions may not be great to use when you are dealing
> with very large amounts of data that may already be getting close to limits
> of your memory. Does the function you call to do one thing to your object
> perhaps overdo it and make multiple copies and not delete them as soon as
> they are not needed?
>
>
>
> An example was a recent post suggesting a nice set of tools you can use to
> convert your data.frame so the columns are integers or dates no matter how
> they were read in from a CSV file or created.
>
>
>
> What I noticed is that often copies of a sort were made by trying to change
> the original say to one date format or another and then deciding which, if
> any to keep. Sometimes multiple transformations are tried and this may be
> done repeatedly with intermediates left lying around. Yes, the memory will
> all be implicitly returned when the function completes. But often these
> functions invoke yet other functions which work on their copies. You an end
> up with your original data temporarily using multiple times as much actual
> memory.
>
>
>
> R does have features so some things are "shared" unless one copy or another
> changes. But in the cases I am looking at, changes are the whole idea.
>
>
>
> What I wonder is whether such functions should clearly call an rm() or the
> equivalent as soon as possible when something is no longer needed.
>
>
>
> The various kinds of pipelines are another case in point as they involve
> all
> kinds of hidden temporary variables that eventually need to be cleaned up.
> When are they removed? I have seen pipelines with 10 or more steps as
> perhaps data is read in, has rows removed or columns removed or re-ordered
> and grouping applied and merged with others and reports generated. The
> intermediates are often of similar sizes with the data and if large, can
> add
> up. If writing the code linearly using temp1 and temp2 type of variables to
> hold the output of one stage and the input of the text stage, I would be
> tempted to add a rm(temp1) as soon as it was finished being used, or just
> reuse the same name of temp1 so the previous contents are no longer being
> pointed to and can be taken by the garbage collector at some time.
>
>
>
> So I wonder if some functions should have a note in their manual pages
> specifying what may happen to the volume of data as they run. An example
> would be if I had a function that took a matrix and simply squared it using
> matrix multiplication. There are various ways to do this and one of them
> simply makes a copy and invokes the built-in way in R that multiplies two
> matrices. It then returns the result. So you end up storing basically three
> times the size  of the matrix right before you return it. Other methods
> might do the actual multiplication in loops operating on subsections of the
> matrix and if done carefully, never keep more than say 2.1 times as much
> data around.
>
>
>
> Or is this not important often enough? All I know, is data may be getting
> larger much faster than memory in our machines gets larger.
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.h

Re: [R] Large data and space use

2021-11-27 Thread Jeff Newmiller
First priority is to obtain a correct answer. Second priority is to document it 
and write tests for it. Third priority is to optimize it. Sometimes it is 
useful to keep intermediate values around to support supplemental calculations 
ala "summary", that may or may not lead to using rm where you might think it 
should be. But often the optimization step is simply neglected.

On November 27, 2021 9:56:50 AM PST, Avi Gross via R-help 
 wrote:
>Several recent questions and answers have mad e me look at some code and I
>realized that some functions may not be great to use when you are dealing
>with very large amounts of data that may already be getting close to limits
>of your memory. Does the function you call to do one thing to your object
>perhaps overdo it and make multiple copies and not delete them as soon as
>they are not needed?
>
> 
>
>An example was a recent post suggesting a nice set of tools you can use to
>convert your data.frame so the columns are integers or dates no matter how
>they were read in from a CSV file or created.
>
> 
>
>What I noticed is that often copies of a sort were made by trying to change
>the original say to one date format or another and then deciding which, if
>any to keep. Sometimes multiple transformations are tried and this may be
>done repeatedly with intermediates left lying around. Yes, the memory will
>all be implicitly returned when the function completes. But often these
>functions invoke yet other functions which work on their copies. You an end
>up with your original data temporarily using multiple times as much actual
>memory.
>
> 
>
>R does have features so some things are "shared" unless one copy or another
>changes. But in the cases I am looking at, changes are the whole idea.
>
> 
>
>What I wonder is whether such functions should clearly call an rm() or the
>equivalent as soon as possible when something is no longer needed.
>
> 
>
>The various kinds of pipelines are another case in point as they involve all
>kinds of hidden temporary variables that eventually need to be cleaned up.
>When are they removed? I have seen pipelines with 10 or more steps as
>perhaps data is read in, has rows removed or columns removed or re-ordered
>and grouping applied and merged with others and reports generated. The
>intermediates are often of similar sizes with the data and if large, can add
>up. If writing the code linearly using temp1 and temp2 type of variables to
>hold the output of one stage and the input of the text stage, I would be
>tempted to add a rm(temp1) as soon as it was finished being used, or just
>reuse the same name of temp1 so the previous contents are no longer being
>pointed to and can be taken by the garbage collector at some time.
>
> 
>
>So I wonder if some functions should have a note in their manual pages
>specifying what may happen to the volume of data as they run. An example
>would be if I had a function that took a matrix and simply squared it using
>matrix multiplication. There are various ways to do this and one of them
>simply makes a copy and invokes the built-in way in R that multiplies two
>matrices. It then returns the result. So you end up storing basically three
>times the size  of the matrix right before you return it. Other methods
>might do the actual multiplication in loops operating on subsections of the
>matrix and if done carefully, never keep more than say 2.1 times as much
>data around. 
>
> 
>
>Or is this not important often enough? All I know, is data may be getting
>larger much faster than memory in our machines gets larger.
>
> 
>
> 
>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data set

2012-07-23 Thread arun
HI,

You can try dbLoad() from hash package.  Not sure whether it will be successful.

A.K.



- Original Message -
From: Lorcan Treanor 
To: r-help@r-project.org
Cc: 
Sent: Monday, July 23, 2012 8:02 AM
Subject: [R] Large data set

Hi all,

Have a problem. Trying to read in a data set that has about 112,000,000
rows and 8 columns and obviously enough it was too big for R to handle. The
columns are mode up of 2 integer columns and 6 logical columns. The text
file is about 4.2 Gb in size. Also I have 4 Gb of RAM and 218 Gb of
available space on the hard drive. I tried the dumpDF function but it was
too big. Also tried bring in the data is 10 sets of about 12,000,000. Are
there are other ways of getting around the size of the data.

Regards,

Lorcan

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data set

2012-07-23 Thread jim holtman
First of all, try to determine the smallest file you can read with an
empty workspace.  Once you have done that, then break up your file
into that size sets and read them in.  The next question is what do
you want to do with 112M rows of data.  Can you process them a set a
time and then aggregate the results.  I have no problem in reading in
files with 10M rows on a 32-bit version of R on Windows with 3GB of
memory.

So a little more information on "what is the problem you are trying to
solve" would be useful.

On Mon, Jul 23, 2012 at 8:02 AM, Lorcan Treanor
 wrote:
> Hi all,
>
> Have a problem. Trying to read in a data set that has about 112,000,000
> rows and 8 columns and obviously enough it was too big for R to handle. The
> columns are mode up of 2 integer columns and 6 logical columns. The text
> file is about 4.2 Gb in size. Also I have 4 Gb of RAM and 218 Gb of
> available space on the hard drive. I tried the dumpDF function but it was
> too big. Also tried bring in the data is 10 sets of about 12,000,000. Are
> there are other ways of getting around the size of the data.
>
> Regards,
>
> Lorcan
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] large data set (matrix) using image()

2011-12-29 Thread Uwe Ligges
Works perfectly well with R-2.14.1 32-bit on a Windows device. Since you 
have not followed the posting guide and forgot to give details about 
your platform, there is not much we can do.


Uwe Ligges



On 22.12.2011 23:08, Karen Liu wrote:


When I use the image() function for a relatively small matrix it works perfectly, 
eg.x<- 1:100

z<- matrix(rnorm(10^4),10^2,10^2)
image(x=x,y=x,z=z,col=rainbow(3))but when I want to plot a larger matrix, it 
doesn't really work. Most of the times, it just plot a few intermitent 
points.x<- 1:1000

z<- matrix(rnorm(10^6),10^3,10^3)
image(x=x,y=x,z=z,col=rainbow(3))

Generating the matrix didn't seem to be a problem. I would appreciate any 
thoughts and ideas.
I have tried using heatmap in bioconductor. However, I want to substitute the 
dendograms with axes, but when I suppressed the dendogram, I can't successfully 
add any axis. If anyone knows heatmap() well and would like to help via this 
function, it would work also.
Cheers!Karen Liu
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large Data

2010-06-14 Thread Joris Meys
http://www.google.com/#hl=en&source=hp&q=R+big+data+sets&aq=f&aqi=g1&aql=&oq=&gs_rfai=&fp=686584f57664

Cheers
Joris

On Mon, Jun 14, 2010 at 12:07 PM, Meenakshi
 wrote:
>
> HI,
>
> I want to import 1.5G CSV file in R.
> But the following error comes:
>
> 'Victor allocation 12.4 size'
>
> How to read the large CSV file in R .
>
> Any one can help me?
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Large-Data-tp2254130p2254130.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large Data

2010-06-14 Thread Joris Meys
And this one is only from last week. Please, read the posting guides carefully.

 Cheers
Joris


-- Forwarded message --
From: Joris Meys 
Date: Sat, Jun 5, 2010 at 11:04 PM
Subject: Re: [R] What is the largest in memory data object you've
worked with in R?
To: Nathan Stephens 
Cc: r-help 


You have to take some things into account :
- the maximum memory set for R might not be the maximum memory available
- R needs the memory not only for the dataset. Matrix manipulations
require frquently double of the amount of memory taken by the dataset.
- memory allocation is important when dealing with large datasets.
There is plenty of information about that
- R has some packages to get around memory problems with big datasets.

Read this discussione for example :
http://tolstoy.newcastle.edu.au/R/help/05/05/4507.html

and this page of Matthew Keller is a good summary too :
http://www.matthewckeller.com/html/memory.html

Cheers
Joris

On Sat, Jun 5, 2010 at 12:32 AM, Nathan Stephens  wrote:
> For me, I've found that I can easily work with 1 GB datasets.  This includes
> linear models and aggregations.  Working with 5 GB becomes cumbersome.
> Anything over that, and R croaks.  I'm using a dual quad core Dell with 48
> GB of RAM.
>
> I'm wondering if there is anyone out there running jobs in the 100 GB
> range.  If so, what does your hardware look like?
>
> --Nathan
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php



On Mon, Jun 14, 2010 at 12:07 PM, Meenakshi
 wrote:
>
> HI,
>
> I want to import 1.5G CSV file in R.
> But the following error comes:
>
> 'Victor allocation 12.4 size'
>
> How to read the large CSV file in R .
>
> Any one can help me?
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Large-Data-tp2254130p2254130.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data set in R

2009-03-02 Thread Hardi
Thanks Kjetil. This is exactly what I wanted.

 Hardi






From: Kjetil Halvorsen 

Cc: r-help 
Sent: Monday, March 2, 2009 9:45:43 PM
Subject: Re: [R] Large data set in R

install.packages("biglm", dep=TRUE)
library(help=biglm)

kjetil





Hello,

I'm trying to use R statistical packages to do ANOVA analysis using aov() and 
lm().
I'm having a problem when I have a large data set for input data from Full 
Factorial Design Experiment with replications.
R seems to store everything in the memory and it fails when memory is not 
enough to hold the massive computation.

Have anyone successfully used R to do such analysis before? Are there any work 
around on this problem?

Thanks,

Hardi

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data set in R

2009-03-02 Thread Kjetil Halvorsen
install.packages("biglm", dep=TRUE)
library(help=biglm)

kjetil

On Mon, Mar 2, 2009 at 7:06 AM, Hardi  wrote:

>
> Hello,
>
> I'm trying to use R statistical packages to do ANOVA analysis using aov()
> and lm().
> I'm having a problem when I have a large data set for input data from Full
> Factorial Design Experiment with replications.
> R seems to store everything in the memory and it fails when memory is not
> enough to hold the massive computation.
>
> Have anyone successfully used R to do such analysis before? Are there any
> work around on this problem?
>
> Thanks,
>
> Hardi
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-29 Thread Avram Aelony


Hi Martin,

Sorry for the late reply.  I realize this might now be straying too  
far from r-help, if there is a better forum for this topic (R use  
with Hadoop) please let me know.


I agree it would indeed be great to leverage Hadoop via R syntax or R  
itself.  A first step is figuring out how computations could be  
translated into map and reduce steps.  I am beginning to see efforts  
in this direction:


http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal
http://www.cs.stanford.edu/people/ang//papers/nips06- 
mapreducemulticore.pdf

http://cwiki.apache.org/MAHOUT/

Per Wikipedia, "A mahout is a person who drives an elephant".  It  
would be nice if PIG and R either played well together or adopted  
each other's strengths (in driving the Hadoop elephant)!



Avram






On Aug 22, 2008, at 9:24 AM, Martin Morgan wrote:


Hi Avram --

My understanding is that Google-like map / reduce achieves throughput
by coordinating distributed calculation with distributed data.

snow, Rmpi, nws, etc provide a way of distributing calculations, but
don't help with coordinating distributed calculation with distributed
data.

SQL (at least naively implemented as a single database server) doesn't
help with distributed data and the overhead of data movement from the
server to compute nodes might be devastating. A shared file system
across compute nodes (the implicit approach usually taken parallel R
applications) offloads data distribution to the file system, which may
be effective for not-too-large (10's of GB?) data.

Many non-trivial R algorithms are not directly usable in distributed
map, because they expect to operate on 'all of the data' rather than
on data chunks. Out-of-the-box 'reduce' in R is limited really to
collation (the parallel lapply-like functions) or sapply-like
simplification; one would rather have more talented reducers (e.g., to
aggregate bootstrap results).

The list of talents required to exploit Hadoop starts to become
intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it
would certainly be useful to have that encapsulated in a way that
requires only R skills!

Martin

<[EMAIL PROTECTED]> writes:


Hi

Apart from database interfaces such as sqldf which Gabor has
mentioned, there are also packages specifically for handling large
data: see the "ff" package, for instance.

I am currently playing with parallelizing R computations via  
Hadoop. I

haven't looked at PIG yet though.

Rory


-Original Message- From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau  
Sent: 21

August 2008 20:04 To: Avram Aelony Cc: r-help@r-project.org Subject:
Re: [R] Large data sets with R (binding to hadoop available?)

Hi

Avram Aelony wrote:

Dear R community,
I find R fantastic and use R whenever I can for my data analytic
needs.  Certain data sets, however, are so large that other tools
seem to be needed to pre-process data such that it can be brought
into R for further analysis.
Questions I have for the many expert contributors on this list are:
1. How do others handle situations of large data sets (gigabytes,
terabytes) for analysis in R ?


I usually try to store the data in an SQLite database and interface

via functions from the packages RSQLite (and DBI).


No idea about Question No. 2, though.

Hope this helps,
Roland


P.S. When I am sure that I only need a certain subset of large data

sets, I still prefer to do some pre-processing in awk (gawk).

2.P.S. The size of my data sets are in the gigabyte range (not

terabyte range). This might be important if your data sets are
*really large* and you want to use sqlite:
http://www.sqlite.org/whentouse.html


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

* 
**

The Royal Bank of Scotland plc. Registered in Scotland No

90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.

Authorised and regulated by the Financial Services Authority

This e-mail message is confidential and for use by

the=2...{{dropped:22}}


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting 

Re: [R] Large Data Set Help

2008-08-25 Thread Charles C. Berry

On Mon, 25 Aug 2008, Roland Rau wrote:


Hi,

Jason Thibodeau wrote:

 I am attempting to perform some simple data manipulation on a large data
 set. I have a snippet of the whole data set, and my small snippet is 2GB
 in
 CSV.

 Is there a way I can read my csv, select a few columns, and write it to an
 output file in real time? This is what I do right now to a small test
 file:

 data <- read.csv('data.csv', header = FALSE)

 data_filter <- data[c(1,3,4)]

 write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
 FALSE, col.names = FALSE)


in this case, I think R is not the best tool for the job. I would rather 
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB 
unzipped), piped into gawk)

unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt


Or

unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt

But beware that both this and Roland's solution will return

a,c,d

for an input line consisting of

a,"b,c",d,e,f

HTH,

Chuck


and it took about 90 seconds.

Please note that you might need to specify your delimiter (field separator 
(FS) and output field separator (OFS)) =>

gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv

I hope this helps (despite not encouraging the usage of R),
Roland

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




Charles C. Berry(858) 534-2098
Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large Data Set Help

2008-08-25 Thread Roland Rau

Hi,

Jason Thibodeau wrote:

I am attempting to perform some simple data manipulation on a large data
set. I have a snippet of the whole data set, and my small snippet is 2GB in
CSV.

Is there a way I can read my csv, select a few columns, and write it to an
output file in real time? This is what I do right now to a small test file:

data <- read.csv('data.csv', header = FALSE)

data_filter <- data[c(1,3,4)]

write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
FALSE, col.names = FALSE)


in this case, I think R is not the best tool for the job. I would rather 
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB 
unzipped), piped into gawk)

unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt
and it took about 90 seconds.

Please note that you might need to specify your delimiter (field 
separator (FS) and output field separator (OFS)) =>

gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv

I hope this helps (despite not encouraging the usage of R),
Roland

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large Data Set Help

2008-08-25 Thread jim holtman
Establish a "connection" with the file you want to read, read in 1,000
rows (or whatever you want).  If you are using read.csv and there is a
header, you might want to skip it initially since there will be no
header when you read the next 1000 rows.  Also put 'as.is=TRUE" so
that character fields are not converted to factors.  You can then
write out the columns that you want.  You can put this in a loop till
you reach the end of file.

On Mon, Aug 25, 2008 at 3:34 PM, Jason Thibodeau <[EMAIL PROTECTED]> wrote:
> I am attempting to perform some simple data manipulation on a large data
> set. I have a snippet of the whole data set, and my small snippet is 2GB in
> CSV.
>
> Is there a way I can read my csv, select a few columns, and write it to an
> output file in real time? This is what I do right now to a small test file:
>
> data <- read.csv('data.csv', header = FALSE)
>
> data_filter <- data[c(1,3,4)]
>
> write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
> FALSE, col.names = FALSE)
>
> This test file writes the three columns to my desired output file. Can I do
> this while bypassing the storage of the entire array in memory?
>
> Thank you very much for the help.
> --
> Jason
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-22 Thread Martin Morgan
Hi Avram --

My understanding is that Google-like map / reduce achieves throughput
by coordinating distributed calculation with distributed data.

snow, Rmpi, nws, etc provide a way of distributing calculations, but
don't help with coordinating distributed calculation with distributed
data.

SQL (at least naively implemented as a single database server) doesn't
help with distributed data and the overhead of data movement from the
server to compute nodes might be devastating. A shared file system
across compute nodes (the implicit approach usually taken parallel R
applications) offloads data distribution to the file system, which may
be effective for not-too-large (10's of GB?) data.

Many non-trivial R algorithms are not directly usable in distributed
map, because they expect to operate on 'all of the data' rather than
on data chunks. Out-of-the-box 'reduce' in R is limited really to
collation (the parallel lapply-like functions) or sapply-like
simplification; one would rather have more talented reducers (e.g., to
aggregate bootstrap results).

The list of talents required to exploit Hadoop starts to become
intimidating (R, Java, Hadoop, PIG, + cluster management, etc), so it
would certainly be useful to have that encapsulated in a way that
requires only R skills!

Martin

<[EMAIL PROTECTED]> writes:

> Hi
>
> Apart from database interfaces such as sqldf which Gabor has
> mentioned, there are also packages specifically for handling large
> data: see the "ff" package, for instance.
>
> I am currently playing with parallelizing R computations via Hadoop. I
> haven't looked at PIG yet though.
>
> Rory
>
>
> -Original Message- From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau Sent: 21
> August 2008 20:04 To: Avram Aelony Cc: r-help@r-project.org Subject:
> Re: [R] Large data sets with R (binding to hadoop available?)
>
> Hi
>
> Avram Aelony wrote:
>> Dear R community,
>> I find R fantastic and use R whenever I can for my data analytic
>> needs.  Certain data sets, however, are so large that other tools
>> seem to be needed to pre-process data such that it can be brought
>> into R for further analysis.
>> Questions I have for the many expert contributors on this list are:
>> 1. How do others handle situations of large data sets (gigabytes,
>> terabytes) for analysis in R ?
>>
> I usually try to store the data in an SQLite database and interface
>> via functions from the packages RSQLite (and DBI).
>
> No idea about Question No. 2, though.
>
> Hope this helps,
> Roland
>
>
> P.S. When I am sure that I only need a certain subset of large data
>> sets, I still prefer to do some pre-processing in awk (gawk).
> 2.P.S. The size of my data sets are in the gigabyte range (not
>> terabyte range). This might be important if your data sets are
>> *really large* and you want to use sqlite:
>> http://www.sqlite.org/whentouse.html
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ***
> The Royal Bank of Scotland plc. Registered in Scotland No
>> 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
> Authorised and regulated by the Financial Services Authority
>
> This e-mail message is confidential and for use by
>> the=2...{{dropped:22}}
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-22 Thread Thomas Lumley

On Thu, 21 Aug 2008, Roland Rau wrote:

Hi

Avram Aelony wrote: (in part)


1. How do others handle situations of large data sets (gigabytes, 
terabytes) for analysis in R ?


I usually try to store the data in an SQLite database and interface via 
functions from the packages RSQLite (and DBI).


No idea about Question No. 2, though.

Hope this helps,
Roland


P.S. When I am sure that I only need a certain subset of large data sets, I 
still prefer to do some pre-processing in awk (gawk).
2.P.S. The size of my data sets are in the gigabyte range (not terabyte 
range). This might be important if your data sets are *really large* and you 
want to use sqlite: http://www.sqlite.org/whentouse.html




I use netCDF for (genomic) datasets in the 100Gb range, with the ncdf 
package, because SQLite was too slow for the sort of queries I needed. 
HDF5 would be another possibility; I'm not sure of the current status of 
the HDF5 support in Bioconductor, though.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-22 Thread Rory.WINSTON
Hi

Apart from database interfaces such as sqldf which Gabor has mentioned, there 
are also packages specifically for handling large data: see the "ff" package, 
for instance.

I am currently playing with parallelizing R computations via Hadoop. I haven't 
looked at PIG yet though.

Rory


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Rau
Sent: 21 August 2008 20:04
To: Avram Aelony
Cc: r-help@r-project.org
Subject: Re: [R] Large data sets with R (binding to hadoop available?)

Hi

Avram Aelony wrote:
>
> Dear R community,
>
> I find R fantastic and use R whenever I can for my data analytic needs.
> Certain data sets, however, are so large that other tools seem to be
> needed to pre-process data such that it can be brought into R for
> further analysis.
>
> Questions I have for the many expert contributors on this list are:
>
> 1. How do others handle situations of large data sets (gigabytes,
> terabytes) for analysis in R ?
>
I usually try to store the data in an SQLite database and interface via 
functions from the packages RSQLite (and DBI).

No idea about Question No. 2, though.

Hope this helps,
Roland


P.S. When I am sure that I only need a certain subset of large data sets, I 
still prefer to do some pre-processing in awk (gawk).
2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). 
This might be important if your data sets are *really large* and you want to 
use sqlite: http://www.sqlite.org/whentouse.html

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

***
The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered 
Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority 

This e-mail message is confidential and for use by the=2...{{dropped:22}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-21 Thread Roland Rau

Hi

Avram Aelony wrote:


Dear R community,

I find R fantastic and use R whenever I can for my data analytic needs.  
Certain data sets, however, are so large that other tools seem to be 
needed to pre-process data such that it can be brought into R for 
further analysis.


Questions I have for the many expert contributors on this list are:

1. How do others handle situations of large data sets (gigabytes, 
terabytes) for analysis in R ?


I usually try to store the data in an SQLite database and interface via 
functions from the packages RSQLite (and DBI).


No idea about Question No. 2, though.

Hope this helps,
Roland


P.S. When I am sure that I only need a certain subset of large data 
sets, I still prefer to do some pre-processing in awk (gawk).
2.P.S. The size of my data sets are in the gigabyte range (not terabyte 
range). This might be important if your data sets are *really large* and 
you want to use sqlite: http://www.sqlite.org/whentouse.html


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large data sets with R (binding to hadoop available?)

2008-08-21 Thread Gabor Grothendieck
RSQLite package can read files into an SQLite database without the data going
through R. sqldf package provides a front end that makes it
particularly easy to
use - basically you need only a couple of lines of code.  Other databases have
similar facilities.  See:

http://sqldf.googlecode.com

On Thu, Aug 21, 2008 at 2:32 PM, Avram Aelony <[EMAIL PROTECTED]> wrote:
>
> Dear R community,
>
> I find R fantastic and use R whenever I can for my data analytic needs.
>  Certain data sets, however, are so large that other tools seem to be needed
> to pre-process data such that it can be brought into R for further analysis.
>
> Questions I have for the many expert contributors on this list are:
>
> 1. How do others handle situations of large data sets (gigabytes, terabytes)
> for analysis in R ?
>
> 2. Are there existing ways or plans to devise ways to use the R language to
> interact with Hadoop or PIG ?  The Hadoop project by Apache has been
> successful at processing data on a large scale using the map-reduce
> algorithm.  A sister project uses an emerging language called "PIG-latin" or
> simply "PIG" for using the Hadoop framework in a manner reminiscent of the
> look and feel of R.  Is there an opportunity here to create a conceptual
> bridge since these projects are also open-source?  Does it already exist?
>
>
> Thanks in advance for your comments.
>
> -Avram
>
>
>
>
> ---
> Information about Hadoop:
> http://wiki.apache.org/hadoop/
> http://en.wikipedia.org/wiki/Hadoop
>
> "Apache Hadoop is a free Java software framework that supports data
> intensive distributed applications running on large clusters of commodity
> computers.[1] It enables applications to work with thousands of nodes and
> petabytes of data. Hadoop was inspired by Google's MapReduce and Google File
> System (GFS) papers."
>
>
>
> ---
> Information about PIG:
>
> http://incubator.apache.org/pig/
>
> "Pig is a platform for analyzing large data sets that consists of a
> high-level language for expressing data analysis programs, coupled with
> infrastructure for evaluating these programs. The salient property of Pig
> programs is that their structure is amenable to substantial parallelization,
> which in turns enables them to handle very large data sets.
> At the present time, Pig's infrastructure layer consists of a compiler that
> produces sequences of Map-Reduce programs, for which large-scale parallel
> implementations already exist (e.g., the Hadoop subproject). Pig's language
> layer currently consists of a textual language called Pig Latin, which has
> the following key properties:
>
> * Ease of programming. It is trivial to achieve parallel execution of
> simple, "embarrassingly parallel" data analysis tasks. Complex tasks
> comprised of multiple interrelated data transformations are explicitly
> encoded as data flow sequences, making them easy to write, understand, and
> maintain.
> * Optimization opportunities. The way in which tasks are encoded permits the
> system to optimize their execution automatically, allowing the user to focus
> on semantics rather than efficiency.
> * Extensibility. Users can create their own functions to do special-purpose
> processing."
>
> ---__
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.