[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2021-04-27 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334280#comment-17334280
 ] 

Flink Jira Bot commented on FLINK-5944:
---

This issue was marked "stale-assigned" and has not received an update in 7 
days. It is now automatically unassigned. If you are still working on it, you 
can assign it to yourself again. Please also give an update about the status of 
the work.

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / Hadoop Compatibility
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>Priority: Major
>  Labels: features, stale-assigned
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2021-04-16 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323669#comment-17323669
 ] 

Flink Jira Bot commented on FLINK-5944:
---

This issue is assigned but has not received an update in 7 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / Hadoop Compatibility
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>Priority: Major
>  Labels: features, stale-assigned
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203408#comment-16203408
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user mlipkovich commented on the issue:

https://github.com/apache/flink/pull/4683
  
@aljoscha @haohui ,
thank you for your comments. Marked hadoop dependency as provided, set 
Hadoop codec as default one


> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183002#comment-16183002
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user mlipkovich commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r141424281
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --

Thanks for your comment Aljoscha,
So there are at least three ways on how to achieve it: either mark this 
dependency as 'provided', move Hadoop Snappy Codec related classes to 
flink-java module or move it to some separate module as suggested @haohui, but 
I'm not sure what should be inside this module


> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182389#comment-16182389
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r141312154
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --

As of recently everything except the Hadoop compatibility package is free 
from Hadoop dependencies. This should also stay like this so that we can 
provide a Flink distribution without Hadoop dependencies because this was 
causing problems for some users.


> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179308#comment-16179308
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user mlipkovich commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r140833786
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --

Yes, it is a good point to make Hadoop Snappy a default codec. I think we 
still could support a Xerial Snappy since it comes for free. I will do these 
changes once we agree on dependencies

Regarding to separate module what would be the content of this model?  
As I understand a user which would like to read HDFS files will need 
flink-java module anyway since it contains Hadoop wrappers like 
HadoopInputSplit and so on. How do you think if it makes sense to put this 
Hadoop codec there?


> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178593#comment-16178593
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user haohui commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r140696992
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --


Internally we have several users that try Flink to read the files generated 
by Hadoop (e.g. lz4 / gz / snappy). I think the support of Hadoop is quite 
important.

I'm not sure supporting the xerial snappy format is a good idea. The two 
file formats are actually incompatible -- it would be quite confusing for the 
users to find out that they can't access the files using Spark / MR / Hive due 
to a missed configuration.

I suggest at least we should make the Hadoop file format as the default -- 
or to just get rid of the xerial version of the file format.

Putting the dependency in provided sounds fine to me -- if we need even 
tighter controls on the dependency, we can start thinking about having a 
separate module for it.

What do you think?



> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178140#comment-16178140
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user mlipkovich commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r140652438
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --

What do you think about adding this dependency to compile-time only?

Regarding to difference between codecs as I understand the thing is that 
Snappy compressed files are not splittable. So Hadoop splits raw files into 
blocks and compresses each block separately using regular Snappy. If you 
download the whole Hadoop Snappy compressed file regular Snappy will not be 
able to decompress it since it's not aware of block boundaries


> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178113#comment-16178113
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

Github user haohui commented on a diff in the pull request:

https://github.com/apache/flink/pull/4683#discussion_r140649772
  
--- Diff: flink-core/pom.xml ---
@@ -52,6 +52,12 @@ under the License.
flink-shaded-asm

 
+   
+   org.apache.flink
+   flink-shaded-hadoop2
+   ${project.version}
+   
--- End diff --

Including hadoop as a dependency in flink-core can be problematic for a 
number of downstream projects.

I wonder what is the exact difference between the Hadoop and vanilla snappy 
codec? Is it just due to the fact that there are additional framings in the 
snappy codec in Hadoop?





> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-19 Thread Mikhail Lipkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171805#comment-16171805
 ] 

Mikhail Lipkovich commented on FLINK-5944:
--

The build has failed again but errors seem to be unrelated (kafka-connector and 
scala examples). Locally it was built and tested.
[~Zentol] could you please review changes?

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171711#comment-16171711
 ] 

ASF GitHub Bot commented on FLINK-5944:
---

GitHub user mlipkovich opened a pull request:

https://github.com/apache/flink/pull/4683

[FLINK-5944] Support reading of Snappy files

## What is the purpose of the change

Support reading of Snappy compressed text files (both Xerial and Hadoop 
snappy)

## Brief change log

  - *Added InputStreamFactories for Xerial and Hadoop snappy*
  - *Added config parameter to control whether Xerial or Hadoop snappy 
should be used*

## Verifying this change

  - *Manually verified the change by running word count for text files 
compressed using different Snappy versions*

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): no
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
  - The serializers: no
  - The runtime per-record code paths (performance sensitive): no
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no

## Documentation

  - Does this pull request introduce a new feature? yes
  - If yes, how is the feature documented? JavaDocs 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mlipkovich/flink FLINK-5944

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4683


commit c4d4016f1e6b44833d24994c97532b4c5243e4d2
Author: Mikhail Lipkovich 
Date:   2017-09-19T13:34:10Z

[FLINK-5944] Support reading of Snappy files




> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-11 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161590#comment-16161590
 ] 

Chesnay Schepler commented on FLINK-5944:
-

Option 2/3 are out imo as they extend the existing API which is already really 
loaded.
Option 4 is not really viable, since realistically, users are just not gonna do 
it.

So i would go for Option 1.

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-08 Thread Mikhail Lipkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158295#comment-16158295
 ] 

Mikhail Lipkovich commented on FLINK-5944:
--

[~aljoscha], [~Zentol] sorry if I'm bothering you. What do you think about 
comment above?

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-09-04 Thread Mikhail Lipkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152293#comment-16152293
 ] 

Mikhail Lipkovich commented on FLINK-5944:
--

I started working on this issue and I would like to get your opinion about one 
question.
Desired codec for InputFormat is selected based on file extension (e.g. '.gzip' 
or '.snappy'). So the question is how we can distinguish whether the Hadoop 
Snappy codec or Java Snappy codec is needed.

I can propose the following options:
1. Add new config option to flink-conf.yaml like fs.hadoop-snappy and select 
InputStreamFactory based on this option
2. Add flag parameter to API method readTextFile whether the file is Hadoop 
Snappy
3. Add separate API method for reading snappy-compressed files
4. Ask users to use '.snappy' extension for Java Snappy and some other 
extension like '.hsnappy' for Hadoop Snappy

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>Assignee: Mikhail Lipkovich
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files

2017-08-31 Thread Mikhail Lipkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148863#comment-16148863
 ] 

Mikhail Lipkovich commented on FLINK-5944:
--

Hi,
Can I assign this task to me? This task seems doable for the beginner
Thanks

> Flink should support reading Snappy Files
> -
>
> Key: FLINK-5944
> URL: https://issues.apache.org/jira/browse/FLINK-5944
> Project: Flink
>  Issue Type: New Feature
>  Components: Batch Connectors and Input/Output Formats
>Reporter: Ilya Ganelin
>  Labels: features
>
> Snappy is an extremely performant compression format that's widely used 
> offering fast decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory 
> and updating the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we 
> must provide two separate implementations since Hadoop uses a different 
> version of the snappy format than Snappy Java (which is the xerial/snappy 
> included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)