Spark streaming saveAsHadoopFiles API question

Hemanth Yamijala Thu, 04 Sep 2014 03:11:25 -0700

Hi,

I extended the Spark streaming wordcount example to save files to Hadoop
file system - just to test how that interface works. In doing so, I ran
into an API problem that I hope folks here can help clarify.


My goal was to see how I could save the final word counts generated in each
micro-batch to HDFS. The final word counts DStream is of
type JavaPairDStream<String, Integer>. When I call the saveAsHadoopFiles
API, I used the API below:

saveAsHadoopFiles(String prefix, String suffix, Class<?> keyClass, Class<?>
valueClass, Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>>
outputFormatClass)

I used it as:

wordCounts.saveAsHadoopFiles("hdfs://localhost:...", "txt", Text.class,
IntWritable.class, TextOutputFormat.class);

This fails with an error indicating TextOutputFormat.class cannot by
applied to Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>>
outputFormatClass.

I believe the error is because the API expects a generic version of
OutputFormat which cannot be specified in Java (due to erasure ?? - not
sure).

The way out was to declare the word counts variable as being of type
JavaPairDStream without generics, i.e. JavaPairDStream and not
JavaPairDStream<String, Integer> and this worked fine.

Is this the recommended solution, or is there something better I can do ?

If this is the right solution, then could the Spark API be simplified to be
declared as saveAsHadoopFiles(String prefix, String suffix, Class<?>
keyClass, Class<?> valueClass, Class<? extends
org.apache.hadoop.mapred.OutputFormat> outputFormatClass) without making
OutputFormat generic as well ?

Version of Spark is 1.0.2.

Thanks
Hemanth

Spark streaming saveAsHadoopFiles API question

Reply via email to