potiuk commented on issue #27476:
URL: https://github.com/apache/airflow/issues/27476#issuecomment-1310191484

   > @potiuk, the biggest gap with git-sync is that we effectively have no way 
to poll less frequently. This isn't a concern at small scale, but there is a 
point with many instances using a monorepo where it becomes problematic.
   
   Hmm. Unless you mean something else, this is entirely possible. You could 
even set different period for workers, and diffferent for scheduler and even 
disable it completely for K8S pod (they only need the init container to run the 
sync once when they are started): https://github.com/kubernetes/git-sync
   
   ```
       --period <duration>, $GIT_SYNC_PERIOD
               How long to wait between sync attempts.  This must be at least
               10ms.  This flag obsoletes --wait, but if --wait is specified, it
               will take precedence.  If not specified, this defaults to 10
               seconds ("10s").
   ```
   
   And let's not forget the fact that running scheduler(s) and even workers 
using persistent volumes perform far more frequent polling. Polling of GitSync 
(if no files are modified) is just single HTTP call with latest Commit Hash in 
the branch and and the files were not updated, it just returns "no change". 
Polling of the persistency server by scheduler and worker is far worse. Because 
both scheduler (continuously) and worker (every time new task is picked) scan 
remote directory listing of such persistent volume. For example when it is NFS 
- which is vast majority of persistent volumes implementation, scheduler is 
basically continuously polling and flooding the NFS server - non, stop, 
sometimes hundreds of roundrips or even thousands of roundtrips per second (see 
my article). Git Sync on the other hand (unless there are changes coming) is 
pretty much one, very small HTTP request per 10 seconds by default and you can 
make the period longer. And even if there are changes, Git does fan
 tastic job on compressing and sending only incremental changes that are 
comiung. This is a stark contrast with Persistent Volumes. They are far less 
optimised and in vast majority of implementations, if a single line changes in 
your DAG file, the whole DAG file will have to be send. It's very different 
when you use Git Sync Protocol. I'd argue that because of those optimisations 
and fact that Git Sync has super efficient way of checking if "anything" 
changed, the polling when you have persistency is few orders of magnitude worse 
than that of Git.
   
   Question @jedcunningham : Have you ever had a case or seen a single user 
complaining about "efficiency" of their git server? I have not seen any. But I 
saw many, many, many users complaining how EFS (which is NFS-based persistency 
used in Amazon) was unstable and inefficient for them and they had to pay a lot 
for IOPS to make it stable. This is because Airflow use case is "slow number of 
DAG changes, huge number of polling for DAG volumes even if no DAG changed". 
This makes Git far superior solution. It's not "generally" better, but it's 
IMHO always better for Airflow.
   
   > 
   > There are probably better solutions to that problem, but polling less 
frequently when you aren't on 1 LocalExecutor is asking for heartache 
eventually. And "polling less frequently" is a natural knob to reach for, 
unfortunately, if you hit that case or think you might, try and get ahead of 
the game, etc.
   
    
   > Relying on (presumably) syncing from github also introduces risk in your 
production system, persistence is at least "local" to some degree. I'm 
generally pro gitsync without persistence, but there are definitely scenarios 
where it isn't enough.
   
   This can be very easily solved by polling a local Git server. Git has a 
fantastic super-efficient synchronisation protocol and if users are keeping the 
file in GitHub anyway, it's extremely simple to have a local mirror. Similarly 
as persistent volumes - all major cloud providers allow not only to setup 
"persistency" but also "Github local mirrror" with a single click - simply 
because this is all built-in Git protocol. In case you need to achieve GitHub 
independence, you just set-up a mirror, and use that mirror as your Git repo 
source. Case solved. Works. 
   
   
   
   Setting up mirror is also an easy way to distribute "polling" workload if 
you actually hit the limits of your git server. And it is much better and more 
controllable than shared, persistent volumes. With Git-sync you might setup a 
mirror and use them if you do not want to use main repo and "overload" it. With 
EFS for example the only way to turn the knobs when you start hitting the 
limits is to start paying A LOT for IOPS provisioning - and many of our users 
complained that they could not get the stability without paying a lot for 
pre-provisioned IOPs when they reached certain size and it was a completely 
unexpected and huge expense that they have not foreseen  before, but because 
they were already vested in persistency, they were simply "forced" to do that. 
With pure Git sync approach you have much more knobs to turn - and they are 
super easy - both on premis and in the cloud. 
   
   I am still not convinced at all of any benefit of Persistency + Git Sync. 
   
   * Just persistency - for some cases yes.
   * But when you already use Git to store your files, and you want to use 
GitSync IMHO adding persistency only make things worse, and never better IMHO. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to