potiuk commented on a change in pull request #4936: [AIRFLOW-4115] Multi-staging Aiflow Docker image [Step 1/3] URL: https://github.com/apache/airflow/pull/4936#discussion_r268880220
########## File path: .dockerignore ########## @@ -0,0 +1,107 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# NOTE! This docker ignore uses recommended technique +# Where everything is excluded by default and you deliberately +# Add only those directories/files you need. This is very useful +# To make sure that Docker context is always the same on any machine +# So that generated files are not accidentally added to the context +# This allows Docker's `COPY .` to behave in predictable way + +# Ignore everything +** Review comment: Correct with the duplicates. I reviewed and removed it. There is little worry that we will forget something - the tests will catch it (and it should rarely be needed - adding a new folder or file at the top level should be really rare event) And yes - you need those extra specs because they will exclude what should not be part of those included dirs above. The pattern I follow is (and I saw it suggested by several people): 1) ** -> Exclude everything 2) !/airflow -> include only those top-level folders/top-level files that you need 3) **/__pycache__/ -> from those folders that you included in 2) exclude all the generated content that follows common patterns for generated stuff (like *.pyc file or /__pycache__/ folders) I feel very strong about it - having to debug the context changes (and figuring out what the hell changed to trigger copying of the files again) for many hours. Example: only after a while I realised that there is a generated folder in docs which is _build next to _template (which is not generated, thought the name kind of suggest it). And just generating the docs locally triggered rebuild of Dockerfile (that's why I did not include the whole docs folder :) ). Otherwise I would have to specifically exclude that particular folder (if I knew that I have to do it). I really think adding stuff deliberately to Docker context makes much more sense. Alternative is to move ALL generated content to specific directory, and exclude that directory - but this is not always possible (node_modules) and from my experience people will not "stick" to it - they will start generating random files in different folders. Of course we are not protected if someone starts generating "xyz" file/folder inside "airflow" folder. But it's rather unlikely someone will do something that crazy. I think generating something in sources is rather bad practice and you do it only if you cannot do it otherwise (that's one of the parts I don't like about python - .pyc files are generated next to actual source files. It reminds me of times when I used RCS versioning system and we had source.c,v next to every file.... BRRR) But to be honest - It was much bigger problem and I eventually introduced checksums for important files (setup/docker) to check if we need to run docker build at all. But still I think it makes it much more resilient for future changes that might produce uncontrollable and unpredictable context "growth". ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services