[ https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626980#comment-17626980 ]
Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:54 AM: ------------------------------------------------------------------- Tentatively offering some thoughts :-) If I'm understanding this properly, we have two problems: - The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology. - The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't *inherently* tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail! h2. Option 1: Don't change the API, only the docs This option would leave `read_ipc_file()`, `write_ipc_file()`, `read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions (treating `read_feather()` and `write_feather()` as soft-deprecated, and leaving `write_to_raw()` untouched) The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension is `.arrow`. h2. Option 2: New names for the existing four functions This option would replace `read_ipc_file()` with `read_arrow_file()`, `read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and `feather` versions would be soft-deprecated. The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification. h2. Option 3: Reduce API to two functions This option would have only two functions, `read_arrow()` and `write_arrow()`. Both functions would have a new argument called `format` (or something similar). Users could specify either `format = "stream"` or `format = "file"`. From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms `"stream"` and `"file"` don't actually refer to *where* you're writing the data, but how you *encode* the data when you write it. h2. Preferences? I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal: Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: `read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` and `write_parquet()`. It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is `format` argument that is distinct from the `file`/`sink` argument, and the documentation will need to explain that. Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: `write_arrow_stream()` can in fact write to a file, and `write_arrow_file()` can write to a stream. That's potentially even more confusing. Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. h2. Additional documentation thoughts Regardless of what option we go with, I'll write the user-facing vignettes to use only the newest version of the API, especially in the `arrow.Rmd` vignette and the `read_write.Rmd` vignette where new users are most likely to run across these concepts. In those places I would try my best not to dive into too much detail, because it's a complexity that new users don't need. The question that arises is "where do we talk about the nuance?" To some extent I think we could move some of that to the "details" section of various help topics, but... (and I hate saying this)... it might make sense to write an "Arrow serialization" vignette that would be loosely analogous to the "Data object layout" vignette that I'm proposing to introduce in https://github.com/apache/arrow/pull/14514. On the documentation page it would be grouped in with the developer vignettes (to signal that it's advanced content), but just like I'm doing with "Data object layout", I'll cross reference it from the user-facing vignettes. For instance, in the section on reading and writing arrow (formerly feather) files, there would be a short paragraph that hints at these issues, and then links the user to the serialization vignette where all the detail is unpacked. was (Author: JIRAUSER283377): Tentatively offering some thoughts :-) If I'm understanding this properly, we have two problems: - The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology. - The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't *inherently* tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail! ## Option 1: Don't change the API, only the docs This option would leave `read_ipc_file()`, `write_ipc_file()`, `read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions (treating `read_feather()` and `write_feather()` as soft-deprecated, and leaving `write_to_raw()` untouched) The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension is `.arrow`. ## Option 2: New names for the existing four functions This option would replace `read_ipc_file()` with `read_arrow_file()`, `read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and `feather` versions would be soft-deprecated. The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification. ## Option 3: Reduce API to two functions This option would have only two functions, `read_arrow()` and `write_arrow()`. Both functions would have a new argument called `format` (or something similar). Users could specify either `format = "stream"` or `format = "file"`. From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms `"stream"` and `"file"` don't actually refer to *where* you're writing the data, but how you *encode* the data when you write it. ## Preferences? I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal: Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: `read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` and `write_parquet()`. It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is `format` argument that is distinct from the `file`/`sink` argument, and the documentation will need to explain that. Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: `write_arrow_stream()` can in fact write to a file, and `write_arrow_file()` can write to a stream. That's potentially even more confusing. Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. ## Additional documentation thoughts Regardless of what option we go with, I'll write the user-facing vignettes to use only the newest version of the API, especially in the `arrow.Rmd` vignette and the `read_write.Rmd` vignette where new users are most likely to run across these concepts. In those places I would try my best not to dive into too much detail, because it's a complexity that new users don't need. The question that arises is "where do we talk about the nuance?" To some extent I think we could move some of that to the "details" section of various help topics, but... (and I hate saying this)... it might make sense to write an "Arrow serialization" vignette that would be loosely analogous to the "Data object layout" vignette that I'm proposing to introduce in https://github.com/apache/arrow/pull/14514. On the documentation page it would be grouped in with the developer vignettes (to signal that it's advanced content), but just like I'm doing with "Data object layout", I'll cross reference it from the user-facing vignettes. For instance, in the section on reading and writing arrow (formerly feather) files, there would be a short paragraph that hints at these issues, and then links the user to the serialization vignette where all the detail is unpacked. > [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather > -------------------------------------------------------------------------- > > Key: ARROW-18148 > URL: https://issues.apache.org/jira/browse/ARROW-18148 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R > Reporter: Stephanie Hazlitt > Priority: Minor > Labels: feather > > Following up from [this mailing list > conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq], > I am wondering if the R package should rename `read_ipc_file()` / > write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an > additional alias for both. It might also be helpful to update the > documentation so that users read "Write an Arrow file (formerly known as a > Feather file)" rather than the current Feather-named first approach, assuming > there is a community decision to coalesce around the name Arrow for the file > format, and the project is moving on from the name Feather. -- This message was sent by Atlassian Jira (v8.20.10#820010)