Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    made a simple test in a single-node spark environment 
     
    I used a synthetic dataset which is generated as:  (that’s 20M)
     
    ```scala
    import spark.implicits._
    import org.apache.spark.{SparkContext, TaskContext}
     
    case class Job(title: String, department: String)
     
    case class Person(id: Int, name: String, job: Job)
     
    (0 until 20000000).map(id => Person(id, id.toString, Job(id.toString, 
id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test")
    ```
     
    And then I read the directory and write to another place by 
     
    ```scala
    val df = spark.read.parquet("/home/zhunan/parquet_test")
    
df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out")
    ```
    
    
    without patch, it reads 169 MB, with patch, it will read around 86 MB. 
    
    Basically it proves that the PR is working


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to