[R] Two Problems while trying to aggregate a dataframe

2007-03-24 Thread Delcour Libertus
Hello!

Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want
to work with the data in R. The contents are similar to my following
example.

I have a list with ID-number, personal name and two kinds of
loan-values. I want to aggregate the list, that for each person only one
row remains and where the loan-values are added.

First I tried some commands with tapply but had no success at all. Then
I found in this mailing list a hint for aggregate (though I did not
understand most of that mail).

So I made some efforts with aggregate() and it seems to lead the right way:

[code]
 atest - read.csv2 (aggregatetest.csv)
 str(atest)
`data.frame':   10 obs. of  4 variables:
 $ PrsNr  : int  1 2 2 3 4 5 6 6 6 7
 $ Namen  : Factor w/ 7 levels Holla,Mabba,..: 1 2 2 4 5 6 7 7 7 3
 $ Betrag1: num  1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88
 $ Betrag2: num  3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81
 atest
   PrsNr Namen Betrag1 Betrag2
1  1 Holla1.993.44
2  2 Mabba2.345.32
3  2 Mabba5.235.21
4  3  Pisa4.239.12
5  4 Pulla2.237.32
6  5  Raba2.778.32
7  6  Saba3.836.99
8  6  Saba2.764.45
9  6  Saba6.325.34
10 7 Mulla2.883.81
 aggregate(list(Betrag1=atest$Betrag1),  by=list(PsrNr=atest$PrsNr,
Namen=atest$Namen),  sum)
  PsrNr Namen Betrag1
1 1 Holla1.99
2 2 Mabba7.57
3 7 Mulla2.88
4 3  Pisa4.23
5 4 Pulla2.23
6 5  Raba2.77
7 6  Saba   12.91
[/code]

The result is nearly that I want.

First problem:

How do I get all columnss in my result. Betrag2 is missing.

Second problem:

If I use the aggregate-command on the real data then it is for me
impossible to use more than on by-grouping variable (my example above
has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to
process my command. My computer (Ubuntu Linux, Gmome) freezes. So I
doubt wether I use the appropriate method to follow my target.

Which ist the best way to aggregate dataframes as I want? Are there any
better functions/commands or do I have to learn programming for this?

Greetings

Delcour

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Two Problems while trying to aggregate a dataframe

2007-03-24 Thread jim holtman
On 3/24/07, Delcour Libertus [EMAIL PROTECTED] wrote:

 Hello!

 Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want
 to work with the data in R. The contents are similar to my following
 example.

 I have a list with ID-number, personal name and two kinds of
 loan-values. I want to aggregate the list, that for each person only one
 row remains and where the loan-values are added.

 First I tried some commands with tapply but had no success at all. Then
 I found in this mailing list a hint for aggregate (though I did not
 understand most of that mail).

 So I made some efforts with aggregate() and it seems to lead the right
 way:

 [code]
  atest - read.csv2 (aggregatetest.csv)
  str(atest)
 `data.frame':   10 obs. of  4 variables:
 $ PrsNr  : int  1 2 2 3 4 5 6 6 6 7
 $ Namen  : Factor w/ 7 levels Holla,Mabba,..: 1 2 2 4 5 6 7 7 7 3
 $ Betrag1: num  1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88
 $ Betrag2: num  3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81
  atest
   PrsNr Namen Betrag1 Betrag2
 1  1 Holla1.993.44
 2  2 Mabba2.345.32
 3  2 Mabba5.235.21
 4  3  Pisa4.239.12
 5  4 Pulla2.237.32
 6  5  Raba2.778.32
 7  6  Saba3.836.99
 8  6  Saba2.764.45
 9  6  Saba6.325.34
 10 7 Mulla2.883.81
  aggregate(list(Betrag1=atest$Betrag1),  by=list(PsrNr=atest$PrsNr,
 Namen=atest$Namen),  sum)
 PsrNr Namen Betrag1
 1 1 Holla1.99
 2 2 Mabba7.57
 3 7 Mulla2.88
 4 3  Pisa4.23
 5 4 Pulla2.23
 6 5  Raba2.77
 7 6  Saba   12.91
 [/code]

 The result is nearly that I want.

 First problem:

 How do I get all columnss in my result. Betrag2 is missing.


It really doesn't make sense to have the column Betrag2.  For example, for
'Saba' which of the three values to you want?  If you want the sum of that
column also, then::

 aggregate(list(Betrag1=x.in$Betrag1, Betrag2=x.in$Betrag2),
+ list(PrsNr=x.in$PrsNr, Namen=x.in$Namen), sum)
  PrsNr Namen Betrag1 Betrag2
1 1 Holla1.993.44
2 2 Mabba7.57   10.53
3 7 Mulla2.883.81
4 3  Pisa4.239.12
5 4 Pulla2.237.32
6 5  Raba2.778.32
7 6  Saba   12.91   16.78

would do it.

Second problem:

 If I use the aggregate-command on the real data then it is for me
 impossible to use more than on by-grouping variable (my example above
 has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to
 process my command. My computer (Ubuntu Linux, Gmome) freezes. So I
 doubt wether I use the appropriate method to follow my target.


A data.frame 11000x9 is not very large for processing in R.  If most of the
columns are numeric, this is less than 1MB.  You need to provide exactly
what you are doing, since I process dataframe with 100,000 row and 10
columns without any problems

Here is a quick test of creating a 11000x9 data frame and then aggregating
the columns; takes less than a second.

 n - 11000
 x - data.frame(id=sample(1:20,n,T),runif(n), runif(n), runif(n),
runif(n),
+ runif(n), runif(n), runif(n), runif(n))
 str(x)
'data.frame':   11000 obs. of  9 variables:
 $ id: int  10 7 2 7 19 10 20 18 5 5 ...
 $ runif.n.  : num  0.590 0.960 0.493 0.497 0.746 ...
 $ runif.n..1: num  0.5058 0.0354 0.9629 0.0861 0.5001 ...
 $ runif.n..2: num  0.693 0.389 0.277 0.731 0.726 ...
 $ runif.n..3: num  0.1442 0.0585 0.5610 0.4206 0.4768 ...
 $ runif.n..4: num  0.7917 0.0351 0.5035 0.2469 0.1013 ...
 $ runif.n..5: num  0.259 0.157 0.492 0.264 0.502 ...
 $ runif.n..6: num  0.769 0.236 0.295 0.138 0.776 ...
 $ runif.n..7: num  0.278 0.613 0.219 0.199 0.689 ...
 object.size(x)
[1] 748984
 system.time(print(aggregate(x[2:9], list(x$id), sum)))
   Group.1 runif.n. runif.n..1 runif.n..2 runif.n..3 runif.n..4 runif.n..5
runif.n..6 runif.n..7
11 292.8513   289.6298   290.0143   293.4324   293.3280   284.4095
295.2547   290.5937
22 262.3319   273.1242   271.9370   269.8798   277.1156   273.3304
282.2954   267.8959
33 289.1224   273.2288   270.3815   278.3573   269.7454   267.2329
269.0993   277.7569
44 284.3488   284.9606   281.8343   278.8463   281.6107   277.7917
279.8643   286.0857
55 285.0568   281.0035   286.2571   283.1572   279.0046   275.8996
288.9442   268.2956
66 271.3570   254.0755   273.3773   264.6663   260.6819   264.5135
263.4219   273.2450
77 260.7022   269.0428   266.2272   273.6763   273.4340   258.7884
265.0706   267.7709
88 272.3414   264.2267   276.7643   271.0522   266.9568   268.4925
276.1819   270.2247
99 303.8744   302.0581   283.6805   292.9412   305.3064   291.2466
308.8895   291.9919
10  10 282.4658   265.5525   265.3477   278.4066   267.9801   272.9873
267.4629   276.1851
11  11 272.4771   256.4079   255.4102   261.3409   264.8875   265.5583
274.0994   253.7260
12  12 293.0385   281.3208   288.6191   268.4726   

Re: [R] Two Problems while trying to aggregate a dataframe

2007-03-24 Thread Gabor Grothendieck
Try this:

aggregate(atest[3:4], atest[1:2], sum)

Use a data base and SQL is you don't otherwise have enough
computer resources.


On 3/24/07, Delcour Libertus [EMAIL PROTECTED] wrote:
 Hello!

 Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want
 to work with the data in R. The contents are similar to my following
 example.

 I have a list with ID-number, personal name and two kinds of
 loan-values. I want to aggregate the list, that for each person only one
 row remains and where the loan-values are added.

 First I tried some commands with tapply but had no success at all. Then
 I found in this mailing list a hint for aggregate (though I did not
 understand most of that mail).

 So I made some efforts with aggregate() and it seems to lead the right way:

 [code]
  atest - read.csv2 (aggregatetest.csv)
  str(atest)
 `data.frame':   10 obs. of  4 variables:
  $ PrsNr  : int  1 2 2 3 4 5 6 6 6 7
  $ Namen  : Factor w/ 7 levels Holla,Mabba,..: 1 2 2 4 5 6 7 7 7 3
  $ Betrag1: num  1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88
  $ Betrag2: num  3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81
  atest
   PrsNr Namen Betrag1 Betrag2
 1  1 Holla1.993.44
 2  2 Mabba2.345.32
 3  2 Mabba5.235.21
 4  3  Pisa4.239.12
 5  4 Pulla2.237.32
 6  5  Raba2.778.32
 7  6  Saba3.836.99
 8  6  Saba2.764.45
 9  6  Saba6.325.34
 10 7 Mulla2.883.81
  aggregate(list(Betrag1=atest$Betrag1),  by=list(PsrNr=atest$PrsNr,
 Namen=atest$Namen),  sum)
  PsrNr Namen Betrag1
 1 1 Holla1.99
 2 2 Mabba7.57
 3 7 Mulla2.88
 4 3  Pisa4.23
 5 4 Pulla2.23
 6 5  Raba2.77
 7 6  Saba   12.91
 [/code]

 The result is nearly that I want.

 First problem:

 How do I get all columnss in my result. Betrag2 is missing.

 Second problem:

 If I use the aggregate-command on the real data then it is for me
 impossible to use more than on by-grouping variable (my example above
 has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to
 process my command. My computer (Ubuntu Linux, Gmome) freezes. So I
 doubt wether I use the appropriate method to follow my target.

 Which ist the best way to aggregate dataframes as I want? Are there any
 better functions/commands or do I have to learn programming for this?

 Greetings

 Delcour

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.