Thanks! I'll try to test it. but it only happens sporadically, so not sure if I'll be able to verify it in test cluster.
On Fri, Jun 21, 2019 at 2:56 PM Samuel Kohonen <[email protected]> wrote: > Looks OK on a quick look. I tried to verify that it is the exception that > bubbles up at the point in code I marked in my first email. If I didn't > completely misread the traceback that should be correct exception to catch. > > Hope you can test this in a test cluster before going for production. I > guess the worst case for this change is that WAL uploads would start > failing if there is a typo or something somewhere.. > > On Fri, Jun 21, 2019 at 2:29 PM 'Yun Guo' via wal-e < > [email protected]> wrote: > >> Thanks Samuel! >> Can you help me review if below function can be used to check for the 410 >> exception? >> >> try: >> import google.cloud.exceptions >> except ImportError: >> gcs = None >> >> def is_gcs_response_error(typ, value): >> if gcs is None: >> return False >> >> if not issubclass(typ, google.api_core.exceptions.GoogleAPICallError): >> return False >> >> if value.code == 410: >> return True >> >> return False >> >> >> On Fri, Jun 21, 2019 at 11:27 AM Samuel Kohonen <[email protected]> >> wrote: >> >>> Hey, >>> >>> Seems like for some reason the resumable upload session that the google >>> python library uses for large files disappeared. No idea how common or why >>> that would happen, but unfortunately the google library doesn't seem to >>> retry those errors themselves anymore now that we stopped using the >>> deprecated num_retries parameter directly. >>> >>> Are you open to hacking your wal-e installation a bit to see if just >>> checking for the GoogleAPICallError (and maybe specifically 410) and >>> retrying would fix this? We can think about more cleaner solutions >>> afterwards. Checking for the exception somewhere in this if-branch ( >>> https://github.com/wal-e/wal-e/blob/master/wal_e/worker/upload.py#L119) >>> and making sure it doesn't get to the else block and raised should force >>> wal-e to retry the upload. Is this something you could try adding to your >>> local installation and see if it fixes the situation for you? >>> >>> Cheers, >>> Samuel >>> >>> On Fri, Jun 21, 2019 at 9:43 AM 'Yun Guo' via wal-e < >>> [email protected]> wrote: >>> >>>> >>>> Hi, >>>> >>>> We are using wal-e v1.1 to backup GCS. The total backup is around 3.2T . >>>> We noticed the wal-e processed failed HTTP/410 sporadically and below >>>> is the log. >>>> >>>> Jun 21 02:30:57 wal_e.worker.upload INFO MSG: beginning volume >>>> compression#012 DETAIL: Building volume 1142.#012 >>>> STRUCTURED: time=2019-06-21T02:30:57.666929-00 pid=37373Jun 21 02:30:58 >>>> wal_e.worker.upload INFO MSG: beginning volume compression#012 >>>> DETAIL: Building volume 1143.#012 STRUCTURED: >>>> time=2019-06-21T02:30:58.958880-00 pid=37373Jun 21 02:31:13 >>>> wal_e.worker.upload INFO MSG: beginning volume compression#012 >>>> DETAIL: Building volume 1144.#012 STRUCTURED: >>>> time=2019-06-21T02:31:13.820819-00 pid=37373Jun 21 02:31:14 >>>> wal_e.operator.backup WARNING MSG: blocking on sending WAL segments#012 >>>> DETAIL: The backup was not completed successfully, but we have to >>>> wait anyway. See README: TODO about pg_cancel_backup#012 >>>> STRUCTURED: time=2019-06-21T02:31:14.716392-00 pid=37373Jun 21 02:31:17 >>>> wal_e.main CRITICAL MSG: An unprocessed exception has avoided all error >>>> handling#012 DETAIL: Traceback (most recent call last):#012 >>>> File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 1041, in upload_from_file#012 size, num_retries, >>>> predefined_acl)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 957, in _do_upload#012 num_retries, predefined_acl)#012 >>>> File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 904, in _do_resumable_upload#012 response = >>>> upload.transmit_next_chunk(transport)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/requests/upload.py", >>>> line 396, in transmit_next_chunk#012 >>>> self._process_response(result, len(payload))#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_upload.py", >>>> line 574, in _process_response#012 self._get_status_code, >>>> callback=self._make_invalid)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_helpers.py", >>>> line 93, in require_status_code#012 status_code, u'Expected >>>> one of', *status_codes)#012 >>>> google.resumable_media.common.InvalidResponse: ('Request failed with >>>> status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)#012 >>>> #012 During handling of the above exception, another exception >>>> occurred:#012 #012 Traceback (most recent call last):#012 >>>> File "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line >>>> 87, in shim#012 return f(*args, **kwargs)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 140, >>>> in put_file_helper#012 return >>>> self.blobstore.uri_put_file(self.creds, url, tf)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line >>>> 38, in uri_put_file#012 blob.upload_from_file(fp, size=size, >>>> content_type=content_type)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 1044, in upload_from_file#012 >>>> _raise_from_invalid_response(exc)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 1914, in _raise_from_invalid_response#012 >>>> response.status_code, message, response=response)#012 >>>> google.api_core.exceptions.GoogleAPICallError: 410 PUT >>>> https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UrKU4zHPqzF4fGPeEvhoxJ-2qeIK5xY9SI8O1NIhtOaDn1GC7Q_D4XQVFFXvMVVzuhCLJvUmzTkkKui6M8mpb3BedH15g: >>>> ('Request failed with status code', 410, 'Expected one of', >>>> <HTTPStatus.OK: 200>, 308)#012 #012 During handling of the >>>> above exception, another exception occurred:#012 #012 >>>> Traceback (most recent call last):#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/cmd.py", line 652, in >>>> main#012 pool_size=args.pool_size)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/operator/backup.py", line >>>> 197, in database_backup#012 **kwargs)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/operator/backup.py", line >>>> 500, in _upload_pg_cluster_dir#012 pool.put(tpart)#012 >>>> File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload_pool.py", >>>> line 108, in put#012 self._wait()#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload_pool.py", line >>>> 65, in _wait#012 raise val#012 File >>>> "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run#012 >>>> File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 145, >>>> in __call__#012 k = put_file_helper()#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 101, in >>>> shim#012 exc_processor_cxt=exc_processor_cxt)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 139, in >>>> retry_with_count_internal#012 side_effect_func(exc_tup, >>>> exc_processor_cxt)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 135, >>>> in log_volume_failures_on_error#012 raise >>>> typ(value).with_traceback(tb)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/retries.py", line 87, in >>>> shim#012 return f(*args, **kwargs)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 140, >>>> in put_file_helper#012 return >>>> self.blobstore.uri_put_file(self.creds, url, tf)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line >>>> 38, in uri_put_file#012 blob.upload_from_file(fp, size=size, >>>> content_type=content_type)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 1044, in upload_from_file#012 >>>> _raise_from_invalid_response(exc)#012 File >>>> "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", >>>> line 1914, in _raise_from_invalid_response#012 >>>> response.status_code, message, response=response)#012 >>>> google.api_core.exceptions.GoogleAPICallError: None 410 PUT >>>> https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UrKU4zHPqzF4fGPeEvhoxJ-2qeIK5xY9SI8O1NIhtOaDn1GC7Q_D4XQVFFXvMVVzuhCLJvUmzTkkKui6M8mpb3BedH15g: >>>> ('Request failed with status code', 410, 'Expected one of', >>>> <HTTPStatus.OK: 200>, 308)#012 #012 STRUCTURED: >>>> time=2019-06-21T02:31:17.960909-00 pid=37373 >>>> >>>> >>>> Any idea what we can do to fix it? >>>> >>>> Thanks >>>> >>>> >>>> -- >>>> >>>> Yun GuoSenior Database Engineer | GitLab >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "wal-e" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/wal-e/CAJsFAOz8wwdwGRgV9u4CFnrdQ0QKYMcArpVLOQ%3D%3DvVynT5q-Pw%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/wal-e/CAJsFAOz8wwdwGRgV9u4CFnrdQ0QKYMcArpVLOQ%3D%3DvVynT5q-Pw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >> >> -- >> >> Yun GuoSenior Database Engineer | GitLab >> >> -- >> You received this message because you are subscribed to the Google Groups >> "wal-e" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/wal-e/CAJsFAOw-KNW8%2Bura2r2nr%3DP05WgVc%2BRd0OQy05ARjr5Vqmb9Vg%40mail.gmail.com >> <https://groups.google.com/d/msgid/wal-e/CAJsFAOw-KNW8%2Bura2r2nr%3DP05WgVc%2BRd0OQy05ARjr5Vqmb9Vg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- Yun GuoSenior Database Engineer | GitLab -- You received this message because you are subscribed to the Google Groups "wal-e" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/wal-e/CAJsFAOw9RZcu%3Dd2tMrwjAd3KHKr08M8_dyfscA-KH5nk6ygfwQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
