[jira] [Commented] (HIVE-24066) Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception

Chao Gao (Jira) Sun, 25 Oct 2020 23:39:41 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220516#comment-17220516
 ]


Chao Gao commented on HIVE-24066:
---------------------------------

Next I wrote some Python code to analyze the metadata inside the PARQUET file.
{code:java}
>>> import pyarrow.parquet as pq
>>>
>>> metadata = pq.read_table('day_01.snappy.parquet')
>>> print(metadata.schema)
context: struct<app: struct<build: string, name: string, namespace: string, 
version: string>, device: struct<adTrackingEnabled: bool, advertisingId: 
string, id: string, manufacturer: string, model: string, name: string>, 
library: struct<name: string, version: string>, locale: string, network: 
struct<bluetooth: bool, carrier: string, cellular: bool, wifi: bool>, screen: 
struct<density: double, height: int64, width: int64>, timezone: string, 
userAgent: string>
......
......
-- schema metadata --
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 
1937

>>> data = metadata.to_pandas()
>>> print(data.to_string())
0  {'app': {'build': '123', 'name': 'User App', 'namespace': 'com.abc.xyz', 
'version': '1.0.0'}, 'device': {'adTrackingEnabled': True, 'advertisingId': 
'test', 'id': '1c61295af65611b6', 'manufacturer': 'Quanta', 'model': 'QTAIR7', 
'name': 'QTAIR7'}, 'library': {'name': 'analytics-android', 'version': 
'1.0.0'}, 'locale': 'en-US', 'network': {'bluetooth': False, 'carrier': '', 
'cellular': False, 'wifi': False}, 'screen': {'density': 1.5, 'height': 1128, 
'width': 1920}, 'timezone': 'America/Los_Angeles', 'userAgent': 'Dalvik/2.1.0 
(Linux; U; Android 5.1; QTAIR7 Build/LMY47D)'}
......
......{code}
*From the PARQUET analysis, we could find out that the original parquet 
metadata is generated by SparkSQL. And There is NO column of* *_`os`: 
struct<`name`: string,`version`: string>_*

 

+Could anyone from Open Source community help to answer the following question 
please?+

If original PARQUET file table schema do NOT contain some STRUCT column, and 
Hive table schema DO contain some STRUCT column with data of NULL, is Hive 
expected to handle with the query well? i.e. _*select context.os.name from 
sample_parquet_table;*_ shows NULL instead of exception.

> Hive query on parquet data should identify if column is not present in file 
> schema and show NULL value instead of Exception
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24066
>                 URL: https://issues.apache.org/jira/browse/HIVE-24066
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 3.1.2, 2.3.5
>            Reporter: Jainik Vora
>            Priority: Major
>         Attachments: day_01.snappy.parquet
>
>
> I created a hive table containing columns with struct data type 
>   
> {code:java}
> CREATE EXTERNAL TABLE test_dwh.sample_parquet_table (
>   `context` struct<
>         `app`: struct<
>             `build`: string,
>             `name`: string,
>             `namespace`: string,
>             `version`: string
>             >,
>         `device`: struct<
>             `adtrackingenabled`: boolean,
>             `advertisingid`: string,
>             `id`: string,
>             `manufacturer`: string,
>             `model`: string,
>             `type`: string
>             >,
>         `locale`: string,
>         `library`: struct<
>             `name`: string,
>             `version`: string
>             >,
>         `os`: struct<
>             `name`: string,
>             `version`: string
>             >,
>         `screen`: struct<
>             `height`: bigint,
>             `width`: bigint
>             >,
>         `network`: struct<
>             `carrier`: string,
>             `cellular`: boolean,
>             `wifi`: boolean
>              >,
>         `timezone`: string,
>         `userAgent`: string
>     >
> ) PARTITIONED BY (day string)
> STORED as PARQUET
> LOCATION 's3://xyz/events'{code}
>  
>  All columns are nullable hence the parquet files read by the table don't 
> always contain all columns. If any file in a partition doesn't have 
> "context.os" struct and if "context.os.name" is queried, Hive throws an 
> exception as below. Same for "context.screen" as well.
>   
> {code:java}
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 
> main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with 
> exception java.io.IOException:java.lang.RuntimeException: Primitive type 
> osshould not doesn't match typeos[name]
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 
> main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with 
> exception java.io.IOException:java.lang.RuntimeException: Primitive type 
> osshould not doesn't match typeos[name]java.io.IOException: 
> java.lang.RuntimeException: Primitive type osshould not doesn't match 
> typeos[name] 
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
>   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
>   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
>   at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>   at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
>   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: java.lang.RuntimeException: Primitive type osshould not doesn't 
> match typeos[name] 
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
>  
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
>  
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
>  
>   at 
> org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 16 more{code}
>  
>  Querying context.os shows as null
> {code:java}
> hive> select context.os from test_dwh.sample_parquet_table where day='01' 
> limit 5;
> OK
> NULL
> NULL
> NULL
> NULL
> NULL
>   {code}
>  
>  As a workaround, I tried querying "context.os.name" only if "context.os" is 
> not null but that also gave the same error. *To verify the case statement for 
> null check, I ran below query which should produce "0" in result for all 
> columns produced "1"*. Distinct value of context.os for the partition is NULL 
> so ruled out differences in select with limit. Running the same query in 
> SparkSQL provides the correct result.
> {code:java}
> hive> select case when context.os is null then 0 else 1 end status from 
> test_dwh.sample_parquet_table where year=2020 and month='07' and day=26 and 
> hour='03' limit 5;
> OK
> 1
> 1
> 1
> 1
> 1 {code}
> Able to reproduce this on Hive 2.x and 3.x - we tested on Hive-2.3.5, 
> Hive-3.1.4 (HDP cluster).
>  
> To reproduce, 
>  1. Copy the attached day_01.snappy.parquet to an S3 location 
> "s3://<bucket_name>/events/day=01"
>  2. Create the table {{test_dwh.sample_parquet_table}}
>  3. Run "{{msck repair table test_dwh.sample_parquet_table"}}
>  4. Run this query "{{select context.os.name from 
> test_dwh.sample_parquet_table;"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24066) Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception

Reply via email to