Good day,
May i know what is the different between pyspark.sql.dataframe.DataFrame versus 
pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe format?
 I'm looking for a way to load a huge excel file (4-10GB), i wonder should i 
use third party library spark-excel or just use native pyspark.pandas ? I 
prefer to use Spark dataframe so that it uses the parallelization feature of 
Spark in the executors instead of running it on the driver. 

Can help to advice ?

Detail---df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .load("/path/big_excel.xls")

print(type(df)) # output pyspark.sql.dataframe.DataFrame


import pyspark.pandas as ps
from pyspark.sql import DataFrame  

path="/path/big-excel.xls" 

df= ps.read_excel(path) # output pyspark.pandas.frame.DataFrame
Thank you.

Reply via email to