Hi All, I have an external table in spark whose underlying data files are in parquet format. The table is partitioned. When I try to computed the statistics for a query where partition column is in where clause, the statistics returned contains only the sizeInBytes and not the no of rows count.
val ddl = """create external table test_p (Address String, Age String, CustomerID string, CustomerName string, CustomerSuffix string, Location string, Mobile String, Occupation String, Salary String ) PARTITIONED BY (Country string) Stored as PARQUET LOCATION '/dev/test3'""" spark.sql(ddl) spark.sql("msck repair table test_p") spark.sql("Analyze table test_p compute statistics for columns Address,Age,CustomerID,CustomerName,CustomerSuffix,Location,Mobile,Occupation,Salary,Country").show() spark.sql("Analyze table test_p partition(Country) compute statistics").show() println(spark.sql("select * from test_p where country='Korea'").queryExecution.toStringWithStats) The output I get is : == Parsed Logical Plan == 'Project [*] +- 'Filter ('country = Korea) +- 'UnresolvedRelation `test_p` == Analyzed Logical Plan == Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9] +- Filter (country#9 = Korea) +- SubqueryAlias test_p +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet == Optimized Logical Plan == Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none) +- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none) +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none) == Physical Plan == *FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca... The same is working fine if I have an table whose underlying data file format is TextFile. Am I missing any step above or is it a known thing in spark. Any help would be appreciated. Thanks,Rajat