Hi, I extended the Spark streaming wordcount example to save files to Hadoop file system - just to test how that interface works. In doing so, I ran into an API problem that I hope folks here can help clarify.
My goal was to see how I could save the final word counts generated in each micro-batch to HDFS. The final word counts DStream is of type JavaPairDStream<String, Integer>. When I call the saveAsHadoopFiles API, I used the API below: saveAsHadoopFiles(String prefix, String suffix, Class<?> keyClass, Class<?> valueClass, Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass) I used it as: wordCounts.saveAsHadoopFiles("hdfs://localhost:...", "txt", Text.class, IntWritable.class, TextOutputFormat.class); This fails with an error indicating TextOutputFormat.class cannot by applied to Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass. I believe the error is because the API expects a generic version of OutputFormat which cannot be specified in Java (due to erasure ?? - not sure). The way out was to declare the word counts variable as being of type JavaPairDStream without generics, i.e. JavaPairDStream and not JavaPairDStream<String, Integer> and this worked fine. Is this the recommended solution, or is there something better I can do ? If this is the right solution, then could the Spark API be simplified to be declared as saveAsHadoopFiles(String prefix, String suffix, Class<?> keyClass, Class<?> valueClass, Class<? extends org.apache.hadoop.mapred.OutputFormat> outputFormatClass) without making OutputFormat generic as well ? Version of Spark is 1.0.2. Thanks Hemanth