kennknowles commented on code in PR #33384:
URL: https://github.com/apache/beam/pull/33384#discussion_r1890706630
##########
sdks/python/apache_beam/io/gcp/gcsfilesystem.py:
##########
@@ -377,3 +377,17 @@ def report_lineage(self, path, lineage, level=None):
# bucket only
components = components[:-1]
lineage.add('gcs', *components)
+
+ def check_splittability(self, path):
+ try:
+ file_metadata = self._gcsIO()._status(path)
+ if file_metadata.get('content_encoding', None) == 'gzip':
Review Comment:
Doesn't the content-type also have to be a particular thing in addition to
the content-encoding being set to gzip?
##########
sdks/python/apache_beam/io/filebasedsource.py:
##########
@@ -259,7 +259,15 @@ def splittable(self):
return self._splittable
+def _is_decompressive_transcoding_enabled(file_path):
+
+ return True
Review Comment:
?
(am I parsing this right? it seems like a function definition at the top
level but with a leading underscore and the body of the function is a stub)
##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -945,3 +945,6 @@ def report_lineage(self, path, unused_lineage, level=None):
Unless override by FileSystem implementations, default to no-op.
"""
pass
+
+ def check_splittability(self, path):
+ return True
Review Comment:
This should probably not always be true. If this is a default, perhaps it
should not have a default but be abstract and we implement for various
filesystems. If it is the default, comment so we understand that is why it
ignores the argument.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]