thisisnic commented on issue #50009:
URL: https://github.com/apache/arrow/issues/50009#issuecomment-4556643806

   I had Claude take a look at this, and here's its summary/suggestions. This 
is a bit beyond my C++ knowledge; @jonkeane - does this look about right to you?
   
   ---
   
   ## Analysis
   
   The R package registers a finalizer via `reg.finalizer(s3_finalizer, 
finalize_s3, onexit = TRUE)` that calls `FinalizeS3()` when R exits. This 
ultimately calls `Aws::ShutdownAPI()` in the AWS C++ SDK.
   
   The likely sequence of events:
   
   1. `read_parquet("s3://...")` at the start of the process opens HTTP 
connections to S3, which get cached in the AWS SDK's connection pool
   2. ~42 minutes of processing with no S3 activity — during this time, 
something in the network path (S3 endpoint, load balancer, NAT gateway) closes 
the idle TCP connections
   3. `write_parquet("s3://...")` at the end succeeds (the SDK opens fresh 
connections), but the old dead connections from step 1 may still be in the pool
   4. `quit()` triggers the R finalizer → `FinalizeS3()` → `Aws::ShutdownAPI()`
   5. The SDK walks through all connections in the pool during teardown and 
hits the dead ones, triggering a SIGPIPE (writing to a closed socket)
   6. The C++ `fs::FinalizeS3()` returns a non-OK status
   7. **The R binding wraps this in `StopIfNotOk()`, which calls 
`cpp11::stop()` — this does a `longjmp` out of the middle of SDK teardown**, 
leaving it half-destroyed
   8. R continues its exit process, and subsequent cleanup (GC, static 
destructors) hits the half-torn-down SDK state → segfault at address 0x30
   
   This is consistent with the reporter's finding that replacing `read_parquet` 
from S3 with a local file read eliminated the segfault — no stale connections 
in the pool means clean shutdown.
   
   ## The `StopIfNotOk` problem
   
   The core issue is in `r/src/filesystem.cpp:355`:
   
   ```cpp
   void FinalizeS3() {
   #if defined(ARROW_R_WITH_S3)
     StopIfNotOk(fs::FinalizeS3());
   #endif
   }
   ```
   
   `StopIfNotOk` converts any non-OK C++ status into an R error via 
`cpp11::stop()`, which does a `longjmp`. This is fine for normal operations, 
but during SDK shutdown it's dangerous — you can't safely jump out of the 
middle of `Aws::ShutdownAPI()`. The SDK ends up in a partially-finalized state, 
which causes the actual segfault.
   
   ## Suggested fix
   
   Replace `StopIfNotOk` with a warning so the shutdown completes even if there 
are errors:
   
   ```cpp
   void FinalizeS3() {
   #if defined(ARROW_R_WITH_S3)
     auto status = fs::FinalizeS3();
     if (!status.ok()) {
       cpp11::warning("FinalizeS3 failed: %s", status.ToString().c_str());
     }
   #endif
   }
   ```
   
   This way:
   - The SDK shutdown runs to completion rather than being interrupted 
mid-teardown
   - Users still get a visible message if something went wrong
   - Mid-session callers (e.g. via `arrow:::FinalizeS3()`) see the warning too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to