Felix Neutatz created FLINK-1271:
------------------------------------

             Summary: Extend HadoopOutputFormat and HadoopInputFormat to handle 
Void.class 
                 Key: FLINK-1271
                 URL: https://issues.apache.org/jira/browse/FLINK-1271
             Project: Flink
          Issue Type: Wish
          Components: Hadoop Compatibility
            Reporter: Felix Neutatz
            Priority: Minor


Parquet, one of the most famous and efficient column store formats in Hadoop 
uses Void.class as Key!

At the moment there are only keys allowed which extend Writable.

For example, we would need to be able to do something like:

HadoopInputFormat hadoopInputFormat = new HadoopInputFormat(new 
ParquetThriftInputFormat(), Void.class, AminoAcid.class, job);
ParquetThriftInputFormat.addInputPath(job, new Path("newpath"));
ParquetThriftInputFormat.setReadSupportClass(job, AminoAcid.class);

// Create a Flink job with it
DataSet<Tuple2<Void, AminoAcid>> data = env.createInput(hadoopInputFormat);

Where AminoAcid is a generated Thrift class in this case.

However, I figured out how to output Parquet files with Parquet by creating a 
class which extends HadoopOutputFormat.

Now we will have to discuss, what's the best approach to make the Parquet 
integration happen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to