[I] Add a non-technical introduction to the functioning of arrow [arrow]

via GitHub Thu, 14 Mar 2024 03:06:30 -0700


oliviermeslin opened a new issue, #40547:
URL: https://github.com/apache/arrow/issues/40547


   ### Describe the enhancement requested
   
   __TL;DR: I suggest to complement the `arrow` documentation with an 
introduction to the functioning of `arrow`, specifically designed for R users 
with limited computer science background. I wrote such an [introduction in 
French](https://www.book.utilitr.org/03_fiches_thematiques/fiche_arrow) ([link 
to automated English 
translation](https://www-book-utilitr-org.translate.goog/03_fiches_thematiques/fiche_arrow?_x_tr_sl=fr&_x_tr_tl=en&_x_tr_hl=fr&_x_tr_pto=wapp));
 it's intentionally written in plain and sometimes imprecise language, avoiding 
most technical terms that usually puzzle newcomers. I'm ready to translate it 
to English, provided that the arrow team agrees with adding it to the 
documentation.__
   
   ## Context
   
   I work in a major European statistical organization. My organization decided 
to move away from proprietary statistical softwares (mostly SAS) to embrace 
open source alternatives (mostly R and Python). My organization decided 
recently that data should be stored as Parquet files, making `arrow/dplyr` the 
standard approach to data processing when working with R.
   
   Most of my colleagues have a strong background in statistics and data 
processing, an intermediate level in R, and a limited background in computer 
science.
   
   I noticed repeatedly that `R` users who are new to `arrow` do not use 
`arrow` properly, because of an imperfect understanding of the way `arrow` 
works. For instance, they typically use `collect()` on large Parquet files 
(resulting in RAM saturation), because they do not understand the difference 
between `compute()` and `collect()`, or they write extremely long arrow/dplyr 
queries, resulting in session crashes.
   
   ## The arrow documentation
   
   In my experience, this imperfect understanding comes from two causes:
   
   - Some parts of the documentation are quite difficult to understand for 
newcomers because of unknown technical terms. For instance, my colleagues are 
often unable to understand the [following paragraph from the arrow 
documentation](https://arrow.apache.org/docs/r/) because they do not really 
know what the words interface, API or backend mean: "The arrow R package 
exposes an interface to the Arrow C++ library, enabling access to many of its 
features in R. It provides low-level access to the Arrow C++ library API and 
higher-level access through a [dplyr](https://dplyr.tidyverse.org/) backend and 
familiar R functions." As for myself, I struggled for months before 
understanding these notions.
   
   - The [Apache Arrow R 
Cookbook](https://arrow.apache.org/cookbook/r/index.html) does not really give 
a general overview of the functioning of `arrow`. For instance, the differences 
between `compute()` and `collect()` and between the dplyr and acero execution 
engines, or the limitations of lazy evaluation are not really explained, 
although they are essential when working on large datasets.
   
   Disclaimer: I want to stress that the point made here is not a criticism of 
the Arrow documentation; I just want to point that this documentation may not 
be well suited for newcomers with a limited background in computer science.
   
   ## My suggestion
   
   I noticed that my colleagues were able to use `arrow` properly, once they 
were explained the functioning of `arrow` in plain, non-technical terms. That's 
why I wrote a [long and gentle introduction to `arrow` in 
French](https://www.book.utilitr.org/03_fiches_thematiques/fiche_arrow) ([link 
to automated English 
translation](https://www-book-utilitr-org.translate.goog/03_fiches_thematiques/fiche_arrow?_x_tr_sl=fr&_x_tr_tl=en&_x_tr_hl=fr&_x_tr_pto=wapp)),
 specifically designed for R users with limited computer science background. 
This introduction explains the functioning of `arrow` in plain language, 
avoiding most technical terms that usually puzzle newcomers. I explain 
important notions (for instance: lazy evaluation, execution engine) in 
intuitive and somewhat imprecise terms, so that newcomers get a first rough 
understanding of how things work with `arrow`, before switching to the current 
documentation.
   
   I now think that this introduction could be a valuable contribution to the 
`arrow` documentation. I'm ready to translate/adapt it in English, for instance 
as a vignette for the `R` package, or as an overview chapter in the Apache 
Arrow R Cookbook. However, I'm very aware that this introduction could be 
considered as subpar, not precise enough or even misleading. That's why I would 
like to know if the arrow team is ready to consider this suggestion.
   
   
   ### Component(s)
   
   Documentation, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Add a non-technical introduction to the functioning of arrow [arrow]

Reply via email to