Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/16578 made a simple test in a single-node spark environment I used a synthetic dataset which is generated as: (thatâs 20M) ```scala import spark.implicits._ import org.apache.spark.{SparkContext, TaskContext} case class Job(title: String, department: String) case class Person(id: Int, name: String, job: Job) (0 until 20000000).map(id => Person(id, id.toString, Job(id.toString, id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test") ``` And then I read the directory and write to another place by ```scala val df = spark.read.parquet("/home/zhunan/parquet_test") df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out") ``` without patch, it reads 169 MB, with patch, it will read around 86 MB. Basically it proves that the PR is working
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org