dstandish commented on issue #25186:
URL: https://github.com/apache/airflow/issues/25186#issuecomment-1191854525

   I am actually thinking now that we should abandon URI as the identifier for 
datasets.  I think part of the motivation for choosing URI was [this document 
in 
openlineage](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md).
  But the thing is, openlineage doesn't actually use URI for datasets!  The 
document is misleading.  In OpenLineage, what defines a dataset is actually 
`namespace + name`.  To me, this is much better and I think we should change 
Dataset to do something more similar to that -- if not exactly that.
   
   As we've seen, URI is problematic for a number of reasons.  Just to name a 
few... It isn't "uniquely" defined by a string --- there are _many_ such URI 
that correspond to the same exact resource.  It's not human friendly, and 
really requires that you _also_ have a friendly _name_ you can refer to it with 
(e.g. to present in UI).  It can be confused for "connection string" -- when 
we're mearely looking for a label for a dataset.  URI makes more sense as 
something generated by machines -- not as the human-entered label for a 
dataset.  But openlineage has no need for it anyway.  I plan to look at some 
other lineage standards to see how they identify datasets.  But for now I am 
really feeling that we shouldn't use URI for this purpose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to