When reading a parquet created from a pandas dataframe with an unnamed index spark creates a column named “__index_level_0__” since spark DataFrames do not support row indexing. This looks like it is probably a bug to me, since as a spark user I would expect unnamed index columns to be dropped on read, but might be intended.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() pandas_frame = pd.DataFrame({'str_col': ['a', 'b'], 'num_col':[1, 2]}) pandas_frame.to_parquet('test.parquet') spark_frame = spark.read.parquet('test.parquet') spark_frame.show() +-------+-------+-----------------+ |num_col|str_col|__index_level_0__| +-------+-------+-----------------+ | 1| a| 0| | 2| b| 1| +-------+-------+-----------------+ Thanks, Jesse