[ https://issues.apache.org/jira/browse/HIVE-20574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikko Kivistö resolved HIVE-20574. ---------------------------------- Resolution: Not A Problem The config properties pointed by Mr. Jadhav did the trick, so not a problem > Column statistics give erraneous numDistinct > -------------------------------------------- > > Key: HIVE-20574 > URL: https://issues.apache.org/jira/browse/HIVE-20574 > Project: Hive > Issue Type: Bug > Components: Metastore, Statistics > Affects Versions: 2.3.2 > Environment: Amazon EMR (BigTop based) from emr-5.9.0 to emr-5.16.0. > Reporter: Mikko Kivistö > Assignee: Ajay Jadhav > Priority: Major > Labels: Statistics, statsCollection > > 1) Download the parquet file to s3/hdfs (e.g. hdfs:///tmp/testi_parquet/) > using some tool (aws cli, hdfs command or anything) > - S3: s3://www.smartdatahub.io/data/test.parquet > - HTTP: [http://www.smartdatahub.io/data/test.parquet] > - or the attachmen > eg. with aws cli, wget/curl/distcp can also be used > {\{ aws s3 cp s3://www.smartdatahub.io/data/test.parquet . }} > {\{ hdfs dfs -mkdir hdfs:///tmp/testi_parquet/}}hdfs dfs -put test.parquet }} > {{ hdfs:///tmp/testi_parquet/test.parquet}} > 2) Create table default.testi_parquet2 on top of that using the schema > provided > {{CREATE TABLE `default.testi_parquet2`(}} > {{ `rakennustu` int, }} > {{ `kohdenimi` string, }} > {{ `tekstisuun` int, }} > {{ `tekstikoko` float, }} > {{ `tekstifont` string, }} > {{ `buix_bid` int, }} > {{ `paivitetty` string, }} > {{ `datanomist` string, }} > {{ `geom_geojson` string, }} > {{ `geom` binary, }} > {{ `extractdate` string)}} > {{ ROW FORMAT SERDE }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }} > {{ STORED AS INPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' }} > {{ OUTPUTFORMAT }} > {{ 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'}} > {{ LOCATION}} > {{ 'hdfs:///tmp/testi_parquet/';}} > {{ \{{ – CHANGE THE LOCATION TO THE PREFIX/DIRECTORY YOU DOWNLOADED THE FILE > FROM STEP 1}}}} > 3) To collect the values showing you the actual reality of the data: Query > the distinct count, min and max of column "tekstisuun" > {\{ SELECT COUNT(DISTINCT tekstisuun), MAX(tekstisuun), MIN(tekstisuun) FROM > default.testi_parquet2; }} > and note them (min 0, max 0, distinct 1) > 4) Compute statistics for the table using > {{ANALYZE TABLE default.testi_parquet2 COMPUTE STATISTICS FOR COLUMNS;}} > 5) See erroneous statistics entry for numDistincts: Query the statistics by > using " > {{DESCRIBE FORMATTED default.testi_parquet2 tekstisuun}} > " and note the ERRANEOUS numDistincts value: 2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)