Re: Spark with Parquet

2016-08-22 Thread shamu
Create a hive table x
Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;)

create hive table y with same structure as x except add STORED AS PARQUET; 
INSERT OVERWRITE TABLE y SELECT * FROM x;


This would get you parquet files under /user/hive/warehouse/y (as an
example) you can use this file path for your processing... 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Parquet-tp4923p27584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: word count on parquet file

2016-08-22 Thread shamu
I changed the code to below...

JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile,
ParquetInputFormat.class, NullWritable.class, String.class, mrConf);
JavaRDD words = rdd.values().flatMap(
new FlatMapFunction() {
public Iterable call(String x) {
return Arrays.asList(x.split(","));
}
});
With  this I get below error
 java.lang.NullPointerException
at
org.apache.parquet.hadoop.ParquetInputFormat.getReadSupportInstance(ParquetInputFormat.java:280)
at
org.apache.parquet.hadoop.ParquetInputFormat.getReadSupport(ParquetInputFormat.java:257)
at
org.apache.parquet.hadoop.ParquetInputFormat.createRecordReader(ParquetInputFormat.java:245)

My input file is a simple comma separated employee record, I created a hive
table with STORED AS PARQUET and then loaded the table from another hive
table... I can treat them as simple lines and I just need to do a word
count. So, Does my Key class and Value class make sense?

Thanks a lot for your support.
Best..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581p27583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



word count on parquet file

2016-08-22 Thread shamu
Hi All,
I am a newbie to Spark/Hadoop. 
I want to read a parquet file and a perform a simple word-count. Below is my
code, however I get an error:
Exception in thread "main" java.io.IOException: No input paths specified in
job
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:239)
at
org.apache.parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:349)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at
org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:304)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)

Below is my code. I guess I am missing some core concepts wrt hadoop
InputFormats and making it working with spark. Coul d you please explain the
cause and solution to get this working/
-code
snippet-
JavaSparkContext sc = new JavaSparkContext(conf);
org.apache.hadoop.conf.Configuration mrConf = new Configuration();
mrConf.addResource(inputFile);
JavaPairRDD textInputFormatObjectJavaPairRDD =
sc.newAPIHadoopRDD(mrConf, ParquetInputFormat.class, String.class,
String.class);
JavaRDD words = textInputFormatObjectJavaPairRDD.values().flatMap(
new FlatMapFunction() {
public Iterable call(String x) {
return Arrays.asList(x.split(","));
}
});
long x = words.count();

--thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word-count-on-parquet-file-tp27581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org