Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

Sean Owen Fri, 13 Jan 2023 05:39:25 -0800

One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper
on a Pyspark DataFrame. They're the same thing with different APIs.
Neither has a 'storage format'.


spark-excel might be fine, and it's used with Spark DataFrames. Because it
emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a
read_excel method that could work.
You can try both and see which works for you.

On Thu, Jan 12, 2023 at 9:56 PM second_co...@yahoo.com.INVALID
<second_co...@yahoo.com.invalid> wrote:

>
> Good day,
>
> May i know what is the different between pyspark.sql.dataframe.DataFrame
> versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe
> format?
>
> I'm looking for a way to load a huge excel file (4-10GB), i wonder should
> i use third party library spark-excel or just use native pyspark.pandas ?
> I prefer to use Spark dataframe so that it uses the parallelization
> feature of Spark in the executors instead of running it on the driver.
>
> Can help to advice ?
>
>
> Detail
> ---
>
> df = spark.read \    .format("com.crealytics.spark.excel") \    
> .option("header", "true") \    .load("/path/big_excel.xls")print(type(df)) # 
> output pyspark.sql.dataframe.DataFrame
>
>
> import pyspark.pandas as psfrom pyspark.sql import DataFrame  
> path="/path/big-excel.xls" df= ps.read_excel(path)
>
> # output pyspark.pandas.frame.DataFrame
>
>
> Thank you.
>
>
>

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

Reply via email to