Shengnan YU created FLINK-32562:
-----------------------------------
Summary: FileSink Compactor Service should not use FileWriter from
Sink for OutputStreamBasedFileCompactor
Key: FLINK-32562
URL: https://issues.apache.org/jira/browse/FLINK-32562
Project: Flink
Issue Type: Improvement
Components: Connectors / FileSystem
Affects Versions: 1.18.0
Reporter: Shengnan YU
Gzip format is designed to be concatenatable but it will be broken by Compactor
in FileSink.
It is because when Compactor Service create new compacted file by using
GzipOutputStream, which will create extra bytes at header, which cause the
final file will have extra bytes in header. (Gzip header is presented in every
finished part file, we don't need an extra header in compacted file). This is
because in Compactor Service, it uses the FileWriter specified in FileSink to
create the compacted outputstream. I think will should use an simple bytes
ouputstream to concat stream instead, or at least give a option.
Currently the ConcatFileCompactor only supports pure text file. Many compressed
codec support concating like gzip, bzip2. I think we should support those kind
of concating, otherwise people must use RecordWiseCompactorFactor which is very
ineffcient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)