zachliu commented on PR #67144:
URL: https://github.com/apache/airflow/pull/67144#issuecomment-4635175322
> > we have remote logging on s3 but still read task logs from local (s3
reads add noticeable UI latency) and all our k8s pods/containers share one
`logs` volume on efs, so the local file is the complete, authoritative history,
this pr breaks that behavior ðŸ˜
>
> 🤦 Oof. Presumably you have `delete_local_logs` set to false and now
they're getting truncated. And in your environment downloading and appending
the log is unnecessary overhead.
yeah. let me try to override the `upload()` in `S3RemoteLogIO` to preserve
the local log files
```python
class PatchedS3RemoteLogIO(S3RemoteLogIO):
def upload(self, path, ti):
"""Snapshot the local log to S3 without accumulating or truncating.
We read task logs from local (S3 reads add noticeable UI latency) and
all pods/containers share one logs volume (EFS RWX in k8s, a bind
mount
locally), so the local file is the complete, authoritative history.
amazon provider 9.30.0 (PR #67144) truncates the local file after
each
upload to stop reschedule-mode sensors from growing the S3 object
O(N^2); that empties the files our UI reads from. Instead we upload
with
append=False so each upload overwrites S3 with the full local file.
This
skips the per-poke S3 read (the actual OOM cause) and never
accumulates,
while leaving the local copy intact for fast UI reads. This is only
safe
because the shared volume guarantees local already holds every poke.
"""
path = pathlib.Path(path)
if path.is_absolute():
local_loc = path
remote_loc = os.path.join(self.remote_base,
path.relative_to(self.base_log_folder))
else:
local_loc = self.base_log_folder.joinpath(path)
remote_loc = os.path.join(self.remote_base, path)
if local_loc.is_file():
log = local_loc.read_text()
has_uploaded = self.write(log, remote_loc, append=False)
if has_uploaded and self.delete_local_copy:
shutil.rmtree(os.path.dirname(local_loc))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]