Hi All, I am a newbie to Spark/Hadoop. I want to read a parquet file and a perform a simple word-count. Below is my code, however I get an error: Exception in thread "main" java.io.IOException: No input paths specified in job at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:239) at org.apache.parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:349) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:304) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)
Below is my code. I guess I am missing some core concepts wrt hadoop InputFormats and making it working with spark. Coul d you please explain the cause and solution to get this working/ -----------------------------code snippet----------------------------------------------------------------- JavaSparkContext sc = new JavaSparkContext(conf); org.apache.hadoop.conf.Configuration mrConf = new Configuration(); mrConf.addResource(inputFile); JavaPairRDD<String, String> textInputFormatObjectJavaPairRDD = sc.newAPIHadoopRDD(mrConf, ParquetInputFormat.class, String.class, String.class); JavaRDD<String> words = textInputFormatObjectJavaPairRDD.values().flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(",")); } }); long x = words.count(); --thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org