Good day,
May i know what is the different between pyspark.sql.dataframe.DataFrame versus
pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe format?
I'm looking for a way to load a huge excel file (4-10GB), i wonder should i
use third party library spark-excel or just use native pyspark.pandas ? I
prefer to use Spark dataframe so that it uses the parallelization feature of
Spark in the executors instead of running it on the driver.
Can help to advice ?
Detail---df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.load("/path/big_excel.xls")
print(type(df)) # output pyspark.sql.dataframe.DataFrame
import pyspark.pandas as ps
from pyspark.sql import DataFrame
path="/path/big-excel.xls"
df= ps.read_excel(path) # output pyspark.pandas.frame.DataFrame
Thank you.