shahar1 commented on issue #29115:
URL: https://github.com/apache/airflow/issues/29115#issuecomment-1537162519

   > > filtering according to suffixes shouldn't be done as part of the API 
call to GCS but as separate logic within the hook
   > 
   > I have ambivalent feelings about hooks doing more than the API/service 
allows. If GCS decided to offer `prefix` & `delimiter` I'm not sure if we 
should create a additional functionally?
   > 
   > I think it would be easier to discuss over a PR where we can also see the 
test cases and get a sense of how much further customizations we do from 
Airflow side.
   
   I created a PR draft (#31107) - the bottom line is that the current 
implementation is suboptimal, and a more thorough change should be done. There 
are two main issues regarding the GCS hook/operators:
   1. As I previously stated, `delimiter` is mostly misused as `suffix`. The 
right way to do this filtering is by utilizing a relatively new parameter 
called `matchGlob`, which utilizes glob patterns to filter objects based on 
their paths. Unfortunately, the GCS Python client doesn't support `matchGlob` 
yet - but it's quite easy to patch the appropriate function until it is 
natively integrated.
   
   2. Another issue in the same area is that we have a *very* complex logic for 
supporting wildcards (*) in the source object names, which is not supported by 
the GCS client at all. I'd like to deprecate the usage of wildcards as it would 
become very difficult to handle along with the deprecation of the `delimiter`. 
Should I treat it in a different PR and/or create a new issue for that? (I 
implemented the solution in the draft, but I'm afraid that it might make things 
complex to test)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to