Re: [R] aggregate formula - differing results

Ivan Calandra Mon, 04 Sep 2023 05:00:53 -0700

Thanks Rui for your help; that would be one possibility indeed.

But am I the only one who finds that behavior of aggregate() completelyunexpected and confusing? Especially considering that dplyr::summarise()and doBy::summaryBy() deal with NAs differently, even though they alluse mean(na.rm = TRUE) to calculate the group stats.


Best wishes,
Ivan

On 04/09/2023 13:46, Rui Barradas wrote:

Às 10:44 de 04/09/2023, Ivan Calandra escreveu:
Dear useRs,
I have just stumbled across a behavior in aggregate() that I cannotexplain. Any help would be appreciated!
Sample data:
my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100","FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102","HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88, 13.86,NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52,4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA, 56),THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08),WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT= c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS","HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), row.names = c(1L,2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class = "data.frame")
1) Simple aggregation with 2 variables:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,na.rm = TRUE)
2) Using the dot notation - different results:
aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)

3) Using dplyr, I get the same results as #1:
group_by(my_data, RAWMAT) %>%
   summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
4) It gets weirder: using all columns in #1 give the same results asin #2 but different from #1 and #3aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =mean, na.rm = TRUE)
So it seems it is not only due to the notation (cbind() vs. dot). Isit a bug? A peculiar thing in my dataset? I tend to think this couldbe due to some variables (or their names) as all notations seem toagree when I remove some variables (although I haven't found outwhich variable(s) is (are) at fault), e.g.:
my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100","FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102","HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79,130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea =c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61,1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011,54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982,67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417,35.297), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT","HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L),class = "data.frame")
aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,data = my_data2, FUN = mean, na.rm = TRUE)
aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)

group_by(my_data2, RAWMAT) %>%
   summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))


Thank you in advance for any hint.
Best wishes,
Ivan




     *LEIBNIZ-ZENTRUM*
*FÜR ARCHÄOLOGIE*

*Dr. Ivan CALANDRA*
**Head of IMPALA (IMaging Platform At LeizA)

*MONREPOS* Archaeological Research Centre, Schloss Monrepos
56567 Neuwied, Germany

T: +49 2631 9772 243
T: +49 6131 8885 543
ivan.calan...@leiza.de

leiza.de <http://www.leiza.de/>
<http://www.leiza.de/>
ORCID <https://orcid.org/0000-0003-3816-6359>
ResearchGate
<https://www.researchgate.net/profile/Ivan_Calandra>
LEIZA is a foundation under public law of the State ofRhineland-Palatinate and the City of Mainz. Its headquarters are inMainz. Supervision is carried out by the Ministry of Science andHealth of the State of Rhineland-Palatinate. LEIZA is a researchmuseum of the Leibniz Association.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,
You can define a vector of the columns of interest and subset the datawith it. Then the default na.action = na.omit will no longer removethe rows with NA vals in at least one column and the results are thesame.
However, this will not give the mean values of the other numericcolumns, just of those two.
# define a vector of columns of interest
cols <- c("Length", "Width", "RAWMAT")

# 1) Simple aggregation with 2 variables, select cols:
aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =mean, na.rm = TRUE)
# 2) Using the dot notation - if cols are selected, equal results:
aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)

# 3) Using dplyr, the results are now the same results as #1 and #2:
my_data %>%
  select(all_of(cols)) %>%
  group_by(RAWMAT) %>%
  summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))


Hope this helps,

Rui Barradas


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aggregate formula - differing results

Reply via email to