Hi All ,
I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not pushed down. Its scanning the entire file. *(1 to 30000).map(i => (i, i.toString)).toDF("a", "b").sort("a").write.parquet("/hdfs/path/to/store")* *parquet-tools meta part-r-00186-03addad8-c19d-4812-b83b-a8708606183b.gz.parquet* creator: p*arquet-mr version 1.5.0-cdh5.7.1* (build ${buildNumber}) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- a: OPTIONAL INT32 R:0 D:1 b: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:148 TS:2012 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- a: INT32 GZIP DO:0 FPO:4 SZ:297/635/2.14 VC:148 ENC:BIT_PACKED,PLAIN,RLE b: BINARY GZIP DO:0 FPO:301 SZ:301/1377/4.57 VC:148 ENC:BIT_PACKED,PLAIN,RLE As you can see from the parquet meta the STA field is missing. And spark is scanning all data of all files. Any suggestion ? Thanks // RB