Re: Spark with Parquet

2016-08-22 Thread shamu
Create a hive table x Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;) create hive table y with same structure as x except add STORED AS PARQUET; INSERT OVERWRITE TABLE y SELECT * FROM x; This would get you parquet files under /user/hive/warehouse/y (as an example)

Re: word count on parquet file

2016-08-22 Thread shamu
I changed the code to below... JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile, ParquetInputFormat.class, NullWritable.class, String.class, mrConf); JavaRDD words = rdd.values().flatMap( new FlatMapFunction() { public Iterable call(String

word count on parquet file

2016-08-22 Thread shamu
Hi All, I am a newbie to Spark/Hadoop. I want to read a parquet file and a perform a simple word-count. Below is my code, however I get an error: Exception in thread "main" java.io.IOException: No input paths specified in job at