[ https://issues.apache.org/jira/browse/HIVE-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340225#comment-17340225 ]
Shivam Sharma commented on HIVE-24066: -------------------------------------- I am also facing the same issue. Can we get update here? > Hive query on parquet data should identify if column is not present in file > schema and show NULL value instead of Exception > --------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-24066 > URL: https://issues.apache.org/jira/browse/HIVE-24066 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 2.3.5, 3.1.2 > Reporter: Jainik Vora > Priority: Major > Attachments: day_01.snappy.parquet > > > I created a hive table containing columns with struct data type > > {code:java} > CREATE EXTERNAL TABLE test_dwh.sample_parquet_table ( > `context` struct< > `app`: struct< > `build`: string, > `name`: string, > `namespace`: string, > `version`: string > >, > `device`: struct< > `adtrackingenabled`: boolean, > `advertisingid`: string, > `id`: string, > `manufacturer`: string, > `model`: string, > `type`: string > >, > `locale`: string, > `library`: struct< > `name`: string, > `version`: string > >, > `os`: struct< > `name`: string, > `version`: string > >, > `screen`: struct< > `height`: bigint, > `width`: bigint > >, > `network`: struct< > `carrier`: string, > `cellular`: boolean, > `wifi`: boolean > >, > `timezone`: string, > `userAgent`: string > > > ) PARTITIONED BY (day string) > STORED as PARQUET > LOCATION 's3://xyz/events'{code} > > All columns are nullable hence the parquet files read by the table don't > always contain all columns. If any file in a partition doesn't have > "context.os" struct and if "context.os.name" is queried, Hive throws an > exception as below. Same for "context.screen" as well. > > {code:java} > 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 > main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with > exception java.io.IOException:java.lang.RuntimeException: Primitive type > osshould not doesn't match typeos[name] > 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 > main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with > exception java.io.IOException:java.lang.RuntimeException: Primitive type > osshould not doesn't match typeos[name]java.io.IOException: > java.lang.RuntimeException: Primitive type osshould not doesn't match > typeos[name] > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) > at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > Caused by: java.lang.RuntimeException: Primitive type osshould not doesn't > match typeos[name] > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330) > > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322) > > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249) > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379) > > at > org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at > org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) > ... 16 more{code} > > Querying context.os shows as null > {code:java} > hive> select context.os from test_dwh.sample_parquet_table where day='01' > limit 5; > OK > NULL > NULL > NULL > NULL > NULL > {code} > > As a workaround, I tried querying "context.os.name" only if "context.os" is > not null but that also gave the same error. *To verify the case statement for > null check, I ran below query which should produce "0" in result for all > columns produced "1"*. Distinct value of context.os for the partition is NULL > so ruled out differences in select with limit. Running the same query in > SparkSQL provides the correct result. > {code:java} > hive> select case when context.os is null then 0 else 1 end status from > test_dwh.sample_parquet_table where year=2020 and month='07' and day=26 and > hour='03' limit 5; > OK > 1 > 1 > 1 > 1 > 1 {code} > Able to reproduce this on Hive 2.x and 3.x - we tested on Hive-2.3.5, > Hive-3.1.4 (HDP cluster). > > To reproduce, > 1. Copy the attached day_01.snappy.parquet to an S3 location > "s3://<bucket_name>/events/day=01" > 2. Create the table {{test_dwh.sample_parquet_table}} > 3. Run "{{msck repair table test_dwh.sample_parquet_table"}} > 4. Run this query "{{select context.os.name from > test_dwh.sample_parquet_table;"}} -- This message was sent by Atlassian Jira (v8.3.4#803005)