shahar1 commented on issue #29115: URL: https://github.com/apache/airflow/issues/29115#issuecomment-1537162519
> > filtering according to suffixes shouldn't be done as part of the API call to GCS but as separate logic within the hook > > I have ambivalent feelings about hooks doing more than the API/service allows. If GCS decided to offer `prefix` & `delimiter` I'm not sure if we should create a additional functionally? > > I think it would be easier to discuss over a PR where we can also see the test cases and get a sense of how much further customizations we do from Airflow side. I created a PR draft (#31107) - the bottom line is that the current implementation is suboptimal, and a more thorough change should be done. There are two main issues regarding the GCS hook/operators: 1. As I previously stated, `delimiter` is mostly misused as `suffix`. The right way to do this filtering is by utilizing a relatively new parameter called `matchGlob`, which utilizes glob patterns to filter objects based on their paths. Unfortunately, the GCS Python client doesn't support `matchGlob` yet - but it's quite easy to patch the appropriate function until it is natively integrated. 2. Another issue in the same area is that we have a *very* complex logic for supporting wildcards (*) in the source object names, which is not supported by the GCS client at all. I'd like to deprecate the usage of wildcards as it would become very difficult to handle along with the deprecation of the `delimiter`. Should I treat it in a different PR and/or create a new issue for that? (I implemented the solution in the draft, but I'm afraid that it might make things complex to test) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
