Shengnan YU created FLINK-32562:
-----------------------------------

             Summary: FileSink Compactor Service should not use FileWriter from 
Sink for OutputStreamBasedFileCompactor
                 Key: FLINK-32562
                 URL: https://issues.apache.org/jira/browse/FLINK-32562
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
    Affects Versions: 1.18.0
            Reporter: Shengnan YU


Gzip format is designed to be concatenatable but it will be broken by Compactor 
in FileSink. 

It is because when Compactor Service create new compacted file by using 
GzipOutputStream, which will create extra bytes at header, which cause the 
final file will have extra bytes in header. (Gzip header is presented in every 
finished part file, we don't need an extra header in compacted file). This is 
because in Compactor Service, it uses the FileWriter specified in FileSink to 
create the compacted outputstream. I think will should use an simple bytes 
ouputstream to concat stream instead, or at least give a option.

 

Currently the ConcatFileCompactor only supports pure text file. Many compressed 
codec support concating like gzip, bzip2. I think we should support those kind 
of concating, otherwise people must use RecordWiseCompactorFactor which is very 
ineffcient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to