[ https://issues.apache.org/jira/browse/SPARK-46996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-46996: ----------------------------------- Labels: pull-request-available (was: ) > Use AnalyzeTableCommand overwrite statistics information incorrectly > -------------------------------------------------------------------- > > Key: SPARK-46996 > URL: https://issues.apache.org/jira/browse/SPARK-46996 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4, 3.4.1 > Reporter: Davy Xu > Priority: Major > Labels: pull-request-available > > When the size of the table changes but the total number of rows in the table > does not change, I use the sql statement "analyze table student compute > statistics" to analyze the external table statistics. The Statistics > information returned by the sql statement "desc extended student" only > contains the table Size information, excluding rowCounts information. > Specific operating instructions: > {code:sql} > Create external table > create table student(id int, name string, age int) row format delimited > fields terminated by ',' > lines terminated by '\n' location 'hdfs://nameservice/spark/student'; > The contents of the external table file are as follows: > class1.txt: > 1,'Jack',25 > 2,'Thompson',28 > class2.txt: > 3,'Davy',30 > 4,'Thompson',35 > class3.txt: > 5,'Curry',40 > 6,'Morgan',20 > Import external table data > hdfs dfs -put 1.txt /spark/student > hdfs dfs -put 2.txt /spark/student > Analyze external table statistics > analyze table student compute statistics; > desc extended student; > Return results > Type EXTERNAL > Provider hive > Table Properties [transient_lastDdlTime=1707265554] > Statistics 56 bytes, 4 rows > Location hdfs://nameservice/spark/student > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties [serialization.format=,, line.delim= > , field.delim=,] > Partition Provider Catalog > Modify external table > hdfs dfs -rm /spark/student/student2.txt > hdfs dfs -put student3.txt /spark/student > Analyze the external table again > analyze table student compute statistics; > desc extended student; > Return results > Type EXTERNAL > Provider hive > Table Properties [transient_lastDdlTime=1707265719] > Statistics 55 bytes > Location hdfs://nameservice/spark/student > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties [serialization.format=,, line.delim= > , field.delim=,] > Partition Provider Catalog > {code} > Through the above operation results, I found that when the table size changes > but the number of rows does not change, the statistics should include the new > table size and the total number of rows. I don’t know if it is correct to > only display the statistics of the table size. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org