One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper on a Pyspark DataFrame. They're the same thing with different APIs. Neither has a 'storage format'.
spark-excel might be fine, and it's used with Spark DataFrames. Because it emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a read_excel method that could work. You can try both and see which works for you. On Thu, Jan 12, 2023 at 9:56 PM second_co...@yahoo.com.INVALID <second_co...@yahoo.com.invalid> wrote: > > Good day, > > May i know what is the different between pyspark.sql.dataframe.DataFrame > versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe > format? > > I'm looking for a way to load a huge excel file (4-10GB), i wonder should > i use third party library spark-excel or just use native pyspark.pandas ? > I prefer to use Spark dataframe so that it uses the parallelization > feature of Spark in the executors instead of running it on the driver. > > Can help to advice ? > > > Detail > --- > > df = spark.read \ .format("com.crealytics.spark.excel") \ > .option("header", "true") \ .load("/path/big_excel.xls")print(type(df)) # > output pyspark.sql.dataframe.DataFrame > > > import pyspark.pandas as psfrom pyspark.sql import DataFrame > path="/path/big-excel.xls" df= ps.read_excel(path) > > # output pyspark.pandas.frame.DataFrame > > > Thank you. > > >