[ https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kostas Kloudas closed FLINK-9753. --------------------------------- Resolution: Fixed Merged on master with 66b1f854a0250bdd048808d40f93aa2990476841 > Support Parquet for StreamingFileSink > ------------------------------------- > > Key: FLINK-9753 > URL: https://issues.apache.org/jira/browse/FLINK-9753 > Project: Flink > Issue Type: Sub-task > Components: Streaming Connectors > Reporter: Stephan Ewen > Assignee: Kostas Kloudas > Priority: Major > Fix For: 1.6.0 > > > Formats like Parquet and ORC are great at compressing data and making it fast > to scan/filter/project the data. > However, these formats are only efficient, if they can columnarize and > compress a significant amount of data in their columnar format. If they > compress only a few rows at a time, they produce many short column vecors and > are thus much less efficient. > The Bucketing Sink has the requirement that data is persistent on the target > FileSystem on each checkpoint. > Pushing data through a Parquet or ORC encoder and flushing on each checkpoint > means that for frequent checkpoints, the amount of data > compressed/columnarized in a block is small. Hence, the result is an > inefficiently compressed file. > Making this efficient independently of the checkpoint interval would mean > that the sink needs to first collect (and persist) a good amount of data and > then push it through the Parquet/ORC writers. > I would suggest to approach this as follows: > - When writing to the "in progress files" write the raw records > (TypeSerializer encoding) > - When the "in progress file" is rolled over (published), the sink pushes > the data through the encoder. > - This is not much work on top of the new abstraction and will result in > large blocksand hence in efficient compression. > Alternatively, we can support directly encoding the stream to the "in > progress files" via Parque/ORC, if users know that their combination of data > rate and checkpoint interval will result in large enough chunks of data per > checkpoint interval. -- This message was sent by Atlassian JIRA (v7.6.3#76005)