Re: [R] Huge data sets and RAM problems

2010-04-22 Thread Stella Pachidi
Dear all,

Thank you very much for your replies and help. I will try to work with
your suggestions and come back to you if I need something more.

Kind regards,
Stella Pachidi

On Thu, Apr 22, 2010 at 5:30 AM, kMan  wrote:
> You set records to NULL perhaps (delete, shift up). Perhaps your system is
> susceptible to butterflies on the other side of the world.
>
> Your code may have 'worked' on a small section of data, but the data used
> did not include all of the cases needed to fully test your code. So... test
> your code!
>
> scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read
> time by at least half while taking less RAM memory to do it, do most of your
> post processing, and give you something to better test your code. Or, don't
> use 'nlines' and lose your time/memory benefits over read.table(). 'skip'
> will get you "right to the point" before where things failed. That would be
> an interesting small segment of data to test with.
>
> wordpad can read your file (and then some). Eventually.
>
> Sincerely,
> KeithC.
>
> -Original Message-
> From: Stella Pachidi [mailto:stella.pach...@gmail.com]
> Sent: Monday, April 19, 2010 2:07 PM
> To: r-h...@stat.math.ethz.ch
> Subject: [R] Huge data sets and RAM problems
>
> Dear all,
>
> This is the first time I am sending mail to the mailing list, so I hope I do
> not make a mistake...
>
> The last months I have been working on my MSc thesis project on performing
> data mining techniques on user logs of a software-as-a-service application.
> The main problem  I am experiencing is how to process the huge amount of
> data. More specifically:
>
> I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and
> CPU Intel Core Duo 2GHz.
>
> The user logs data come from a query Crystal report (.rpt file) which I
> transform with some Java code into a tab separated file.
>
> Although with a small subset of my data everything manages to run, when I
> increase the data set I get several problems:
>
> The first problem is with the use of read.delim(). When  I try to read a big
> amount of data  (over 2.400.000 rows and 18 attributes at each
> row) it doesn't seem to transform all table into a data frame. In
> particular, the data frame returned has 1.220.987 rows.
>
> Furthermore, as one of the data attributes is DataTime, when I try to split
> this column into two columns (one with Data and one with the Time), the
> returned result is quite strange, as the two new columns appear to have more
> rows than the data frame:
>
> applicLog.dat <- read.delim("file.txt")
> #Process the syscreated column (Date time --> Date + time) copyDate <-
> applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate)
> splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate)
> splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <-
> splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <-
> splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <-
> strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3)
> sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS")
> applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind
> (sysCreatedDate,sysCreatedTime,applicLog.dat)
>
> Then I get the error: Error in data.frame(..., check.names = FALSE) :
>  arguments imply differing number of rows: 1221063, 1221062, 1220987
>
>
> Finally, another problem I have is when I perform association mining on the
> data set using the package arules: I turn the data frame into transactions
> table and then run the apriori algorithm. When I put too low support in
> order to manage to find the rules I need, the vector of rules becomes too
> big and I get problems with the memory such as:
> Error: cannot allocate vector of size 923.1 Mb In addition: Warning
> messages:
> 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)
>
> Could you please help me with how I could allocate more RAM? Or, do you
> think there is a way to process the data by loading them into a document
> instead of loading all into RAM? Do you know how I could manage to read all
> my data set?
>
> I would really appreciate your help.
>
> Kind regards,
> Stella Pachidi
>
> PS: Do you know any text editor that can read huge .txt files?
>
>
>
>
>
> --
> Stella Pachidi
> Master in Business Informatics student
> Utrecht University
>
>
>
>



-- 
Stella Pachidi
Master in Business Informatics student
Utrecht University
email: s.pach...@students.uu.nl
tel: +31644478898

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Huge data sets and RAM problems

2010-04-21 Thread kMan
You set records to NULL perhaps (delete, shift up). Perhaps your system is
susceptible to butterflies on the other side of the world.

Your code may have 'worked' on a small section of data, but the data used
did not include all of the cases needed to fully test your code. So... test
your code!

scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read
time by at least half while taking less RAM memory to do it, do most of your
post processing, and give you something to better test your code. Or, don't
use 'nlines' and lose your time/memory benefits over read.table(). 'skip'
will get you "right to the point" before where things failed. That would be
an interesting small segment of data to test with.

wordpad can read your file (and then some). Eventually.

Sincerely,
KeithC.

-Original Message-
From: Stella Pachidi [mailto:stella.pach...@gmail.com] 
Sent: Monday, April 19, 2010 2:07 PM
To: r-h...@stat.math.ethz.ch
Subject: [R] Huge data sets and RAM problems

Dear all,

This is the first time I am sending mail to the mailing list, so I hope I do
not make a mistake...

The last months I have been working on my MSc thesis project on performing
data mining techniques on user logs of a software-as-a-service application.
The main problem  I am experiencing is how to process the huge amount of
data. More specifically:

I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and
CPU Intel Core Duo 2GHz.

The user logs data come from a query Crystal report (.rpt file) which I
transform with some Java code into a tab separated file.

Although with a small subset of my data everything manages to run, when I
increase the data set I get several problems:

The first problem is with the use of read.delim(). When  I try to read a big
amount of data  (over 2.400.000 rows and 18 attributes at each
row) it doesn't seem to transform all table into a data frame. In
particular, the data frame returned has 1.220.987 rows.

Furthermore, as one of the data attributes is DataTime, when I try to split
this column into two columns (one with Data and one with the Time), the
returned result is quite strange, as the two new columns appear to have more
rows than the data frame:

applicLog.dat <- read.delim("file.txt")
#Process the syscreated column (Date time --> Date + time) copyDate <-
applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate)
splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate)
splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <-
splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <-
splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <-
strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3)
sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS")
applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind
(sysCreatedDate,sysCreatedTime,applicLog.dat)

Then I get the error: Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1221063, 1221062, 1220987


Finally, another problem I have is when I perform association mining on the
data set using the package arules: I turn the data frame into transactions
table and then run the apriori algorithm. When I put too low support in
order to manage to find the rules I need, the vector of rules becomes too
big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb In addition: Warning
messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do you
think there is a way to process the data by loading them into a document
instead of loading all into RAM? Do you know how I could manage to read all
my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi

PS: Do you know any text editor that can read huge .txt files?





--
Stella Pachidi
Master in Business Informatics student
Utrecht University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Huge data sets and RAM problems

2010-04-20 Thread Jay Emerson
Stella,

A few brief words of advice:

1. Work through your code a line at a time, making sure that each is what
you would expect.  I think some of your later problems are a result of
something
early not being as expected.  For example, if the read.delim() is in fact
not
giving you what you expect, stop there before moving onwards.  I suspect
some funny character(s) or character encodings might be a problem.

2. 32-bit Windows can be limiting. With 2 GB of RAM, you're probably not
going to be able to work effectively in native R with objects over 200-300
MB,
and the error indicates that something (you or a package you're using)
simply
have run out of memory.  So...

3. Consider more RAM (and preferably with 64-bit R).  Other solutions might
be possible, such as using a database to hand the data transition into R.
2.5 million rows by 18 columns is apt to be around 360 MB.  Although you
can afford 1 (or a few) copies of this, it doesn't leave you much room for
the memory overhead of working with such an object.

Part of the oringal message below.

Jay

-

Message: 80
Date: Mon, 19 Apr 2010 22:07:03 +0200
From: Stella Pachidi 
To: r-h...@stat.math.ethz.ch
Subject: [R]  Huge data sets and RAM problems
Message-ID:
   
Content-Type: text/plain; charset=ISO-8859-1

Dear all,



I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM
and CPU Intel Core Duo 2GHz.

.

Finally, another problem I have is when I perform association mining
on the data set using the package arules: I turn the data frame into
transactions table and then run the apriori algorithm. When I put too
low support in order to manage to find the rules I need, the vector of
rules becomes too big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb
In addition: Warning messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do
you think there is a way to process the data by loading them into a
document instead of loading all into RAM? Do you know how I could
manage to read all my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi


-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.