Re: word count on parquet file

2016-08-22 Thread shamu
I changed the code to below...

JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile,
ParquetInputFormat.class, NullWritable.class, String.class, mrConf);
JavaRDD words = rdd.values().flatMap(
new FlatMapFunction() {
public Iterable call(String x) {
return Arrays.asList(x.split(","));
}
});
With  this I get below error
 java.lang.NullPointerException
at
org.apache.parquet.hadoop.ParquetInputFormat.getReadSupportInstance(ParquetInputFormat.java:280)
at
org.apache.parquet.hadoop.ParquetInputFormat.getReadSupport(ParquetInputFormat.java:257)
at
org.apache.parquet.hadoop.ParquetInputFormat.createRecordReader(ParquetInputFormat.java:245)

My input file is a simple comma separated employee record, I created a hive
table with STORED AS PARQUET and then loaded the table from another hive
table... I can treat them as simple lines and I just need to do a word
count. So, Does my Key class and Value class make sense?

Thanks a lot for your support.
Best..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581p27583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: word count on parquet file

2016-08-22 Thread ayan guha
You are missing input. Mrconf is not the way to add input files. In spark,
try Dataframe read functions or sc.textfile function.

Best
Ayan
On 23 Aug 2016 07:12, "shamu"  wrote:

> Hi All,
> I am a newbie to Spark/Hadoop.
> I want to read a parquet file and a perform a simple word-count. Below is
> my
> code, however I get an error:
> Exception in thread "main" java.io.IOException: No input paths specified in
> job
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
> listStatus(FileInputFormat.java:239)
> at
> org.apache.parquet.hadoop.ParquetInputFormat.listStatus(
> ParquetInputFormat.java:349)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
> getSplits(FileInputFormat.java:387)
> at
> org.apache.parquet.hadoop.ParquetInputFormat.getSplits(
> ParquetInputFormat.java:304)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(
> NewHadoopRDD.scala:120)
>
> Below is my code. I guess I am missing some core concepts wrt hadoop
> InputFormats and making it working with spark. Coul d you please explain
> the
> cause and solution to get this working/
> -code
> snippet-
> JavaSparkContext sc = new JavaSparkContext(conf);
> org.apache.hadoop.conf.Configuration mrConf = new Configuration();
> mrConf.addResource(inputFile);
> JavaPairRDD textInputFormatObjectJavaPairRDD =
> sc.newAPIHadoopRDD(mrConf, ParquetInputFormat.class, String.class,
> String.class);
> JavaRDD words = textInputFormatObjectJavaPairRDD.values().flatMap(
> new FlatMapFunction() {
> public Iterable call(String x) {
> return Arrays.asList(x.split(","));
> }
> });
> long x = words.count();
>
> --thanks!
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/word-count-on-parquet-file-tp27581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>