oliviermeslin opened a new issue, #40547: URL: https://github.com/apache/arrow/issues/40547
### Describe the enhancement requested __TL;DR: I suggest to complement the `arrow` documentation with an introduction to the functioning of `arrow`, specifically designed for R users with limited computer science background. I wrote such an [introduction in French](https://www.book.utilitr.org/03_fiches_thematiques/fiche_arrow) ([link to automated English translation](https://www-book-utilitr-org.translate.goog/03_fiches_thematiques/fiche_arrow?_x_tr_sl=fr&_x_tr_tl=en&_x_tr_hl=fr&_x_tr_pto=wapp)); it's intentionally written in plain and sometimes imprecise language, avoiding most technical terms that usually puzzle newcomers. I'm ready to translate it to English, provided that the arrow team agrees with adding it to the documentation.__ ## Context I work in a major European statistical organization. My organization decided to move away from proprietary statistical softwares (mostly SAS) to embrace open source alternatives (mostly R and Python). My organization decided recently that data should be stored as Parquet files, making `arrow/dplyr` the standard approach to data processing when working with R. Most of my colleagues have a strong background in statistics and data processing, an intermediate level in R, and a limited background in computer science. I noticed repeatedly that `R` users who are new to `arrow` do not use `arrow` properly, because of an imperfect understanding of the way `arrow` works. For instance, they typically use `collect()` on large Parquet files (resulting in RAM saturation), because they do not understand the difference between `compute()` and `collect()`, or they write extremely long arrow/dplyr queries, resulting in session crashes. ## The arrow documentation In my experience, this imperfect understanding comes from two causes: - Some parts of the documentation are quite difficult to understand for newcomers because of unknown technical terms. For instance, my colleagues are often unable to understand the [following paragraph from the arrow documentation](https://arrow.apache.org/docs/r/) because they do not really know what the words interface, API or backend mean: "The arrow R package exposes an interface to the Arrow C++ library, enabling access to many of its features in R. It provides low-level access to the Arrow C++ library API and higher-level access through a [dplyr](https://dplyr.tidyverse.org/) backend and familiar R functions." As for myself, I struggled for months before understanding these notions. - The [Apache Arrow R Cookbook](https://arrow.apache.org/cookbook/r/index.html) does not really give a general overview of the functioning of `arrow`. For instance, the differences between `compute()` and `collect()` and between the dplyr and acero execution engines, or the limitations of lazy evaluation are not really explained, although they are essential when working on large datasets. Disclaimer: I want to stress that the point made here is not a criticism of the Arrow documentation; I just want to point that this documentation may not be well suited for newcomers with a limited background in computer science. ## My suggestion I noticed that my colleagues were able to use `arrow` properly, once they were explained the functioning of `arrow` in plain, non-technical terms. That's why I wrote a [long and gentle introduction to `arrow` in French](https://www.book.utilitr.org/03_fiches_thematiques/fiche_arrow) ([link to automated English translation](https://www-book-utilitr-org.translate.goog/03_fiches_thematiques/fiche_arrow?_x_tr_sl=fr&_x_tr_tl=en&_x_tr_hl=fr&_x_tr_pto=wapp)), specifically designed for R users with limited computer science background. This introduction explains the functioning of `arrow` in plain language, avoiding most technical terms that usually puzzle newcomers. I explain important notions (for instance: lazy evaluation, execution engine) in intuitive and somewhat imprecise terms, so that newcomers get a first rough understanding of how things work with `arrow`, before switching to the current documentation. I now think that this introduction could be a valuable contribution to the `arrow` documentation. I'm ready to translate/adapt it in English, for instance as a vignette for the `R` package, or as an overview chapter in the Apache Arrow R Cookbook. However, I'm very aware that this introduction could be considered as subpar, not precise enough or even misleading. That's why I would like to know if the arrow team is ready to consider this suggestion. ### Component(s) Documentation, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org