[
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152293#comment-16152293
]
Mikhail Lipkovich commented on FLINK-5944:
------------------------------------------
I started working on this issue and I would like to get your opinion about one
question.
Desired codec for InputFormat is selected based on file extension (e.g. '.gzip'
or '.snappy'). So the question is how we can distinguish whether the Hadoop
Snappy codec or Java Snappy codec is needed.
I can propose the following options:
1. Add new config option to flink-conf.yaml like fs.hadoop-snappy and select
InputStreamFactory based on this option
2. Add flag parameter to API method readTextFile whether the file is Hadoop
Snappy
3. Add separate API method for reading snappy-compressed files
4. Ask users to use '.snappy' extension for Java Snappy and some other
extension like '.hsnappy' for Hadoop Snappy
> Flink should support reading Snappy Files
> -----------------------------------------
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
> Issue Type: New Feature
> Components: Batch Connectors and Input/Output Formats
> Reporter: Ilya Ganelin
> Assignee: Mikhail Lipkovich
> Labels: features
>
> Snappy is an extremely performant compression format that's widely used
> offering fast decompression/compression.
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project.
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we
> must provide two separate implementations since Hadoop uses a different
> version of the snappy format than Snappy Java (which is the xerial/snappy
> included in Flink).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)