I changed the code to below... JavaPairRDD<NullWritable, String> rdd = sc.newAPIHadoopFile(inputFile, ParquetInputFormat.class, NullWritable.class, String.class, mrConf); JavaRDD<String> words = rdd.values().flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(",")); } }); With this I get below error java.lang.NullPointerException at org.apache.parquet.hadoop.ParquetInputFormat.getReadSupportInstance(ParquetInputFormat.java:280) at org.apache.parquet.hadoop.ParquetInputFormat.getReadSupport(ParquetInputFormat.java:257) at org.apache.parquet.hadoop.ParquetInputFormat.createRecordReader(ParquetInputFormat.java:245)
My input file is a simple comma separated employee record, I created a hive table with STORED AS PARQUET and then loaded the table from another hive table... I can treat them as simple lines and I just need to do a word count. So, Does my Key class and Value class make sense? Thanks a lot for your support. Best.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581p27583.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org