Re: Spark with Parquet
Create a hive table x Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;) create hive table y with same structure as x except add STORED AS PARQUET; INSERT OVERWRITE TABLE y SELECT * FROM x; This would get you parquet files under /user/hive/warehouse/y (as an example) you can use this file path for your processing... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Parquet-tp4923p27584.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: word count on parquet file
I changed the code to below... JavaPairRDDrdd = sc.newAPIHadoopFile(inputFile, ParquetInputFormat.class, NullWritable.class, String.class, mrConf); JavaRDD words = rdd.values().flatMap( new FlatMapFunction () { public Iterable call(String x) { return Arrays.asList(x.split(",")); } }); With this I get below error java.lang.NullPointerException at org.apache.parquet.hadoop.ParquetInputFormat.getReadSupportInstance(ParquetInputFormat.java:280) at org.apache.parquet.hadoop.ParquetInputFormat.getReadSupport(ParquetInputFormat.java:257) at org.apache.parquet.hadoop.ParquetInputFormat.createRecordReader(ParquetInputFormat.java:245) My input file is a simple comma separated employee record, I created a hive table with STORED AS PARQUET and then loaded the table from another hive table... I can treat them as simple lines and I just need to do a word count. So, Does my Key class and Value class make sense? Thanks a lot for your support. Best.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581p27583.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
word count on parquet file
Hi All, I am a newbie to Spark/Hadoop. I want to read a parquet file and a perform a simple word-count. Below is my code, however I get an error: Exception in thread "main" java.io.IOException: No input paths specified in job at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:239) at org.apache.parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:349) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) at org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:304) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120) Below is my code. I guess I am missing some core concepts wrt hadoop InputFormats and making it working with spark. Coul d you please explain the cause and solution to get this working/ -code snippet- JavaSparkContext sc = new JavaSparkContext(conf); org.apache.hadoop.conf.Configuration mrConf = new Configuration(); mrConf.addResource(inputFile); JavaPairRDDtextInputFormatObjectJavaPairRDD = sc.newAPIHadoopRDD(mrConf, ParquetInputFormat.class, String.class, String.class); JavaRDD words = textInputFormatObjectJavaPairRDD.values().flatMap( new FlatMapFunction () { public Iterable call(String x) { return Arrays.asList(x.split(",")); } }); long x = words.count(); --thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org