nkemnitz opened a new issue, #774:
URL: https://github.com/apache/arrow-rs-object-store/issues/774
**Describe the bug**
`ObjectStore::get` on the GCS backend fails with `Error::Generic { store:
"GCS", source: Header { source: MissingContentLength } }` for any object stored
with `Content-Encoding: gzip`. GCS serves these via `Transfer-Encoding:
chunked` with **no `Content-Length`**, and object_store treats a missing
`Content-Length` as fatal.
It manifests in two ways, and crucially **a client cannot fully avoid it**:
- **Default reads (no `Accept-Encoding`):** GCS applies *decompressive
transcoding* — it decompresses the object server-side and streams the result
chunked, with no `Content-Length`. Every gzip object fails, at any size.
- **Even with `Accept-Encoding: gzip`** (which returns the raw stored
bytes): objects whose **stored size exceeds ~8 MiB** are *still* served chunked
with no `Content-Length` (empirically the cutover is between 9 and 10 MB), so
they fail too. Only *small* objects read with `Accept-Encoding: gzip` succeed.
So whether you receive the transcoded (uncompressed) bytes or the raw
(compressed) bytes, the response is chunked with no `Content-Length` — and
object_store rejects it before returning any data. `head()` and range reads on
gzip objects fail as well.
These are valid HTTP/1.1 responses: a chunked body is self-delimiting, and
[RFC 9112 §6.2](https://www.rfc-editor.org/rfc/rfc9112#section-6.2) states a
sender **MUST NOT** send `Content-Length` together with `Transfer-Encoding`;
[RFC 9112 §6.3](https://www.rfc-editor.org/rfc/rfc9112#section-6.3) says the
chunked framing determines the body length and overrides any `Content-Length`.
object_store is therefore requiring a header the spec forbids on exactly these
responses.
**To Reproduce**
Observe the offending response with no credentials against Google's public
demo data (the `?cb=` cache-buster forces origin past the public edge cache):
```console
$ curl -sD - -o /dev/null \
"https://storage.googleapis.com/neuroglancer-public-data/kasthuri2011/ground_truth/6_6_30/5376-5440_6656-6720_896-960?cb=$RANDOM"
\
| grep -iE
"HTTP/|content-length|transfer-encoding|x-goog-stored-content-encoding"
HTTP/2 200
transfer-encoding: chunked
x-goog-stored-content-encoding: gzip
# (no content-length)
```
object_store then fails on that same public object (authenticated read — a
service account or on GCP; verified with object_store 0.13.1 and main
`de0029a`):
```rust
use object_store::gcp::GoogleCloudStorageBuilder;
use object_store::{path::Path, ObjectStore, ObjectStoreExt};
let store = GoogleCloudStorageBuilder::new()
.with_bucket_name("neuroglancer-public-data")
.with_service_account_path(sa) // or workload identity on GCP
.build()?;
store
.get(&Path::from(
"kasthuri2011/ground_truth/6_6_30/5376-5440_6656-6720_896-960",
))
.await?; // Err: Generic { store: "GCS", source: Header { source:
MissingContentLength } }
```
For a self-contained object of any size in your own bucket (a >8 MiB one
fails even with `Accept-Encoding: gzip`):
```console
head -c 20000000 /dev/urandom | gzip | gsutil -h "Content-Encoding:gzip" cp
- gs://YOUR_BUCKET/big.gz
```
> **Note:** the bundled `fake-gcs-server` does **not** emulate decompressive
transcoding (it always
> returns a `Content-Length`), so it cannot reproduce this. The crate's own
`MockServer` can — push a
> response with a chunked body and no `Content-Length` header.
**Expected behavior**
A full GET of a valid, self-delimiting (chunked / HTTP-2) response should
succeed by reading the body to completion rather than requiring a
`Content-Length` header that the HTTP spec forbids on chunked responses.
**Additional context**
- Affects gzip only; `br`/`zstd`/uncompressed objects are served identity
(with `Content-Length`) and read fine.
- Root cause: `header_meta` requires `CONTENT_LENGTH` unconditionally
([`header.rs#L144`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/header.rs#L144),
in [`header_meta`
L114](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/header.rs#L114)),
and the GET path derives `ObjectMeta.size`/`range` from it before streaming
([`get.rs#L314`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/get.rs#L314)
→
[`#L333`](https://github.com/apache/arrow-rs-object-store/blob/de0029aa91f7727015fab37e623fdbe11672914b/src/client/get.rs#L333)).
- Wire evidence (>8 MiB gzip object, authenticated, `Accept-Encoding: gzip`):
`Transfer-Encoding: chunked`, no `Content-Length`,
`x-goog-stored-content-encoding: gzip`,
`x-goog-stored-content-length: <n>`.
<details><summary>Prior art — every other major GCS client reads to EOF
(commit-pinned)</summary>
| client | behavior | source |
|---|---|---|
| google-cloud-storage (Python) | streams via `response.iter_content()`;
`x-goog-stored-content-length` only a retry heuristic |
[download.py#L145](https://github.com/googleapis/python-storage/blob/ab4997ce0f7b85947e84b226bd0edf6d714a946a/google/cloud/storage/_media/requests/download.py#L145),
[#L163](https://github.com/googleapis/python-storage/blob/ab4997ce0f7b85947e84b226bd0edf6d714a946a/google/cloud/storage/_media/requests/download.py#L163)
|
| cloud.google.com/go/storage (Go) | `Reader.Remain()` returns `-1` for
chunked / `Decompressed`; CRC skipped on transcoding |
[reader.go#L436](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/reader.go#L436),
[#L74](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/reader.go#L74),
[http_client.go#L1455](https://github.com/googleapis/google-cloud-go/blob/a25e93d25635b8fd42985edbe0290ba9a8cf2169/storage/http_client.go#L1455)
|
| google-cloud-cpp (Apache Arrow C++) | reads to EOF via `HasUnreadData()`;
**falls back to `x-goog-stored-content-length`** for size |
[object_read_source.cc#L114](https://github.com/googleapis/google-cloud-cpp/blob/149ca440cc492a66e612e2e1f1fb385136530110/google/cloud/storage/internal/rest/object_read_source.cc#L114),
[#L55](https://github.com/googleapis/google-cloud-cpp/blob/149ca440cc492a66e612e2e1f1fb385136530110/google/cloud/storage/internal/rest/object_read_source.cc#L55)
|
| TensorStore (C++ GCS kvstore) | accumulates body via libcurl callback to
EOF; enables Accept-Encoding |
[gcs_key_value_store.cc#L578](https://github.com/google/tensorstore/blob/613280f459520c7dddc9aa11a41412a0c2a6b913/tensorstore/kvstore/gcs_http/gcs_key_value_store.cc#L578)
|
| gcsfs (fsspec) | reads the aiohttp stream to EOF |
[core.py#L118](https://github.com/fsspec/gcsfs/blob/f707e61fa75dcb4dc6b7bad0bc2321d425336a3a/gcsfs/core.py#L118)
|
| rclone (Go) | GCS backend returns `res.Body`, reads to EOF |
[googlecloudstorage.go#L1385](https://github.com/rclone/rclone/blob/59c86b01bb39624650badd39f3acfd20be2b743b/backend/googlecloudstorage/googlecloudstorage.go#L1385),
[issue #2658](https://github.com/rclone/rclone/issues/2658) |
| smart_open (Python) | same bug class, fixed by delegating to
google-cloud-storage | [issue
#422](https://github.com/piskvorky/smart_open/issues/422) |
</details>
:robot: helped writing the report.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]