Spark with Parquet

2014-04-27 Thread Sai Prasanna
Hi All,

I want to store a csv-text file in Parquet format in HDFS and then do some
processing in Spark.

Somehow my search to find the way to do was futile. More help was available
for parquet with impala.

Any guidance here? Thanks !!


Re: Spark with Parquet

2014-04-27 Thread Matei Zaharia
Spark uses the Hadoop InputFormat and OutputFormat classes, so you can simply 
create a JobConf to read the data and pass that to SparkContext.hadoopFile. 
There are some examples for Parquet usage here: 
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ and here: 
http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark.

Matei

On Apr 27, 2014, at 11:41 PM, Sai Prasanna  wrote:

> Hi All,
> 
> I want to store a csv-text file in Parquet format in HDFS and then do some 
> processing in Spark.
> 
> Somehow my search to find the way to do was futile. More help was available 
> for parquet with impala. 
> 
> Any guidance here? Thanks !!
> 



Re: Spark with Parquet

2016-08-22 Thread shamu
Create a hive table x
Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;)

create hive table y with same structure as x except add STORED AS PARQUET; 
INSERT OVERWRITE TABLE y SELECT * FROM x;


This would get you parquet files under /user/hive/warehouse/y (as an
example) you can use this file path for your processing... 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Parquet-tp4923p27584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark with Parquet

2016-08-23 Thread Mohit Jaggi
something like this should work….

val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema 
if the guessed schema is not accurate
df.write.parquet(“myfile.parquet”)


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Apr 27, 2014, at 11:41 PM, Sai Prasanna  wrote:
> 
> Hi All,
> 
> I want to store a csv-text file in Parquet format in HDFS and then do some 
> processing in Spark.
> 
> Somehow my search to find the way to do was futile. More help was available 
> for parquet with impala. 
> 
> Any guidance here? Thanks !!
>