Re: Do we need schema for Parquet files with Spark?

Xinh Huynh Thu, 03 Mar 2016 23:01:30 -0800

Hi Ashok,

On the Spark SQL side, when you create a dataframe, it will have a schema (each 
column has a type such as Int or String). Then when you save that dataframe as 
parquet format, Spark translates the dataframe schema into Parquet data types. 
(See spark.sql.execution.datasources.parquet.) Then Parquet does the dictionary 
encoding automatically (if applicable) based on the data values; this encoding 
is not specified by the user. Parquet figures out the right encoding to use for 
you.


Xinh

> On Mar 3, 2016, at 7:32 PM, ashokkumar rajendran 
> <ashokkumar.rajend...@gmail.com> wrote:
> 
> Hi, 
> 
> I am exploring to use Apache Parquet with Spark SQL in our project. I notice 
> that Apache Parquet uses different encoding for different columns. The 
> dictionary encoding in Parquet will be one of the good ones for our 
> performance. I do not see much documentation in Spark or Parquet on how to 
> configure this. For example, how would Parquet know dictionary of words if 
> there is no schema provided by user? Where/how to specify my schema / config 
> for Parquet format?
> 
> Could not find Apache Parquet mailing list in the official site. It would be 
> great if anyone could share it as well.
> 
> Regards
> Ashok

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Do we need schema for Parquet files with Spark?

Reply via email to