bolkedebruin commented on PR #34729:
URL: https://github.com/apache/airflow/pull/34729#issuecomment-1754626593

   @jens-scheffler-bosch AFS works on a fundamentally different layer than 
Datasets do. Datasets are currently just metadata pointing to arbitrary data 
and are treated as such in Airflow. They are also only evaluated at parsing 
time, which is a major shortcoming in my opinion. AFS gives a consistent 
generic API on top of storage. While that might look the same at first it or 
closely related, it is not when you consider that Datasets could point to 
Tables which are a structured way of representing a consistent set of data 
which in turn are backed-up by a (set of) file(s) in a filesystem residing on 
storage. As mentioned on the AIP I think that indeed that bringing Datasets 
closer to AFS makes sense, but I also think it is out-of-scope of this AIP. 
More a "for further research" kind of thing than something to be included now.
   
   On mount points, I am willing to consider to let it go, but I think it is 
also too early now. In the arguments against it I have only seen "it is 
confusing for users" where users are DAG authors _only_. As mentioned AFS 
targets DAGs, but also XCom and DAG processing. That perspective seems not to 
be included when arguing against using mount points. Then on the content level, 
there are other examples that use a Virtual File System (as AFS is). The 
primary that comes to mind is Databricks FS (dbfs), which does use mount points 
virtually. It seems not to be confusing for their users so 'user data' is not 
necessarily pointing towards 'confusing'.  Another reason, is to provide 
efficiency in **cross** file system operations, so for example s3 to gcs, or 
**same** file system operations, like s3 to s3, without requiring the user to 
deal with that. Again, this is not considered when arguing against mount 
points. Finally, using mount points allows users **not** to think of Airflow's 
comple
 xity. By just changing the mount of a mount point your code operates on local 
storage, s3 or gcs.
   
   I am not sure if you read the PR correctly, as it already supports 
Connections and Providers. `gs, adls, s3` are filesystems obtained from their 
respective Providers and Connections are used if provided or obtained from the 
environment. Which is how Operators in general should, not all of them do, 
behave. Even if solely using Paths for dealing with file operations, we need to 
store connection details somewhere. This needs to be done centrally in order to 
be able to reuse connections and for efficiency. This could be done in a cache 
without exposing it to the user or have it exposed though something like mount 
points so they can be referenced by the user.
   
   I intend to keep the PR updated. There are some disadvantages, but it keeps 
the discussion grounded and allows people to play and experiment with it as 
well. Otherwise we will just be doing a paper exercise on pros and cons no one 
has practical experience with. DAG code is in this PR, although it is not even 
required to be in a DAG(!). It works the same across all, which was the 
intention.
   
   Note that 'Path' and 'mount points' can coexist. They are not mutually 
exclusive. In fact as mentioned above a `Path` interface requires a central 
registration. Whether or not that is exposed is up for discussion. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to