DickJC123 opened a new issue #20738: URL: https://github.com/apache/incubator-mxnet/issues/20738
## Description Here are two independent PR's with the failure: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20635/38/pipeline https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20734/5/pipeline The failure has been reported as an issue with the mirrors supplying oneapi: https://community.intel.com/t5/Registration-Download-Licensing/OneAPI-apt-repository-broken/m-p/1329104 I'm a little suspicious there might be more to it based on 2 observations: 1. The onednn lib is installed by a RUN command in Dockerfile.build.ubuntu. This creates an intermediate docker image that is pulled in from cache in the failing builds: ``` [2021-11-11T23:00:22.939Z] Step 5/20 : RUN export DEBIAN_FRONTEND=noninteractive ... [2021-11-11T23:00:23.196Z] ---> Using cache [2021-11-11T23:00:23.196Z] ---> 1a09ef0af63e ``` The image tag is the same as we've seen for a week or more, well before apparent changes to the mirrors. So are we not handling cached docker images properly? 2. The actual error is in a `apt-get update` performed by a later RUN command that is installing tensor-rt and cudnn. Perhaps the intel repo used to install onednn in the earlier RUN command should be removed from the container in that same step, since the installation is complete? It's possible that the command `add-apt-repository -r "deb https://apt.repos.intel.com/oneapi all main"` would perform that action. If the intel repo were no longer in /etc/apt/sources.list, presumably the currently failing `apt-get update` would succeed. ### Error Message ``` [2021-11-11T23:00:39.105Z] Err:9 https://apt.repos.intel.com/oneapi all/main all Packages [2021-11-11T23:00:39.105Z] Hash Sum mismatch [2021-11-11T23:00:39.105Z] Hashes of expected file: [2021-11-11T23:00:39.105Z] - Filesize:21072 [weak] [2021-11-11T23:00:39.105Z] - SHA512:7082767f95f6e40ad31deb8a9df205fa726ef3f4821ff6982d507f2f91adb57c282d1fbe3253f610b3e07f77a0c3c2320ed2c78b8d4b5b648928dd5c1fea271e [2021-11-11T23:00:39.105Z] - SHA256:7e91d4ace2815407f999e88e5296f678447b9577e1f84af4addc7212c8eb32b0 [2021-11-11T23:00:39.105Z] - SHA1:53e523680f4f09015f82673434772a6ec112e8f2 [weak] [2021-11-11T23:00:39.105Z] - MD5Sum:3f125fa13d509dd4e66fa49ae3d5af96 [weak] [2021-11-11T23:00:39.105Z] Hashes of received file: [2021-11-11T23:00:39.105Z] - SHA512:5af0e2266d2ef7cfd42b907c68d21b020e8e1f6c516e9fb35c7affcd52d047ffedec885f14685eaf6539edfc23c0da8e9c7035bcede483a331d9c66e5dce8c54 [2021-11-11T23:00:39.105Z] - SHA256:97bb376982553d6f5ae07c29a79fd653295caf7599cd6deb3c051c90a0290af1 [2021-11-11T23:00:39.105Z] - SHA1:9e1ac9d3f961d4e376cbc55758a334cc158a9603 [weak] [2021-11-11T23:00:39.105Z] - MD5Sum:db23233f3ef8572c745ff537a2b2fdb8 [weak] [2021-11-11T23:00:39.105Z] - Filesize:21072 [weak] [2021-11-11T23:00:39.105Z] Last modification reported: Tue, 05 Oct 2021 04:38:36 +0000 ``` ## To Reproduce Have not repro'd outside of CI runs. ### Steps to reproduce ## What have you tried to solve it? I was not able to repro the failure using the recipe posted to the intel site, i.e. it worked fine for me. ## Environment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
